Tagging of very large corpora: Topic-Focus Articulation


A['ter a. bri(;f chara(:teriz:~tion of the th(;ory of the tot)i('-fo('us a rticulatioi~ (if the s('.nt('al(:(,. ('I'FA), rules 3A'c formulated that (letermin(; I:he a,~signmenI; of al)t)rol)riate values (If |;hi'. "J']'\~\ a ttril)uÁ;(~ ill I;he l)ro(:(',qs of synl;a(:i;i(:o-s('manti(: tagging of i~ very large ('orlms lit' Cz(;(:h. 1 I n t r o d u c t i o n : T h e P r a g u e D e p e n d e n c y T r e e b a n k ( P D T ) P D T is a corpus (a par t fl'om the Czech Na.tional Cortms), tagged on th(; following h'x('.l>: 1. mort)hemi(: (POS and a illlOl;al;ions using a, v(n'y large nmnl)('.r of i;ag G :is r/ 'quired 1)y the language, with rich intl(~(:ti(/n; (:1". (]lajiC: and llladk(~, ] 997)); 2. 'mmlyl;ic' (del)en(lelwy syntax, with node,q for all word o(:cm'r(,.n(:(>, also for p l l n ( ; t l l a t ion ma,rks etc., aim wit]~ the tags for r o o f t)hemic units and for 1)asic kin(t,q of surfa(:e syntac t ic rch~tion.q (Slfl).je(:t, O1).j('.(:t, Advert)ial, A(ljun('t), (:f. (Ila.ji~,) 3. t cc togrammat ica l (und('.l'lying) syntax, with a iIluch lllOr(; detailed classifit:ation of synl;actic relal;ions and with nodes tbr aul;o,q0.manl;ic lexical oc(-urren('es only (ra.|;her tha.n flln(:l;ion words), with indices corresponding to the syntac t ic relations, such as Actor, Addressee, Objec t ive (Patient), Locative, Mmmer, Means, etc., and to mort)hologieal values sud l as Pre ter i te (Anterior), Condit ional , Plural , etc., and also as the prototyl)ical values of 'in', ' into' , %n', ~from', etc.; ('ol'r(!lates of f lmctional words (a.nd morph(;m('~s) on this leve, l ha v('~ the form of indices of lexi('al nod(', labels.l 1An e x c e p t i o n c o n c e r n s c o o r d i n a t i n g c o n j u n c t i o n s , wh ich , in P D T , are. t r e a t e d as h e a d n o d e s of t h e (:o2 R e p r e s e n t i n g T o p i c F o c u s A r t i c u l a t i o n ( T F A ) in T G T S s 2.1 A I ) r ie f c h a r a c t e r i z a t i o n of T F A 'l'h(; te( : togranunatical tr(,.e struct;ures (TGTSs) should (:alIi;Ul'('. nol, only the syntact ic ((l(,.1)(;n/Mmy) relations, lint also the. TFA of the utt(;ran(:es in the corpus, sin(:('. TFA is cx1)resscd l/y grammal;i(:al me,ms and is releva.nt for the meaning of (;he sentenc(; (even for its trut]t (:onditions), i.e. it; cons t i tu tes one of the basic as1)e(:ts of un(l('rlying s t ructures . Tlm scmanli(: reh',van/:c. (hi' TFA can be illustra.t('d 1)y (~xaml)lcs such as (1), wlfi(:h is a translal:i(m of the Czech (.'x. (1') (the capitals (l('amt(*. the. 1)la(:(;m(mt of th/'. int()naCion /:c.ntr(', i.e. I;tm focus t)rol)er): 2 (1) 0,) 1;.,..d.i.4,. i.,..~.vo/..c.,, i.,. t/,.,; S t l J ; 7 ' L A N I ) S . (b) i',, l.h,e ,%cl, hm, ds, lz,'NGLI,2H is ,~'pol,:e',,. (~') (,~) A,..d.id..:,j .~, .,,,.l.,,..,,~ ,,.,. Shctl,.',,.a.~t,::,j4,. 0,~ Tll, 0 VI~ CH. ordinl;( 'd groul)S. T h i s m a k e s it; l)ossibl(, to r e t ) r e sen t l;he I;(}(;I;og~rantlllai;i(:al st;rll(:l;llres of all s('dlt('.ilt;es a.q I;lee.q (rath(,., than using more-dimensional net:works); in this point, PDT ditlers fl:om the theoretical assumt)tions of th('. l)ragnian lqmctional Gen('xativ('. Descril)t, ion (now discussed in (Haji~':ov(~ (¢ al., 1998)). ~In t h e 1)rol, otyt)i(:at case t h e i n t o n a t i o n (:e.ntre is c h a r a c t e r i z e d 1)y fa l l ing (or r i s ingfa l l ing) s t ress , b u t t h e r e a re also cases in which (similarly as in questions, to a certain degree) the centre has a rising stress. This concerns utterances displaying a featm'e of hesitation or incompleteness, of. (M.,); ofte.n also with greet, ings (such as Czech Dobrd j ihv [Good morning]) a difference of this kind marks the 'starting' token, connected with the expectation of an answering token, which exhibits a riffling sl;ress. Although in it S(~ll|;(*dlCC containing occurrences of l)oth a rising aild & falling sLress the former exl)resses a contrastive (part; of) topic, we l)retier to analyze it its the fOCIlS ill ~ SC'II|;CI].C( '. wiLhoul, all ()c(;urrellCe of t h e lal;l;er; in s u c h a l )os i t ion, t h e r i s ing s t r e s s r egu l a r l y is c a r r i e d 1)3' a n i t e m r e f e r r i n g to ' n e w ' i n f o r m a t i o n . In w r i t t e n t;ext;s, s o m e o c c u r r e n c e s of | ;he r i s ing s t r e s s a re m a r k e d 1) 3, a s e m i c o l o n or by ' . . . '.

