Large scale syntactic annotation for Dutch. Gertjan van Noord University of Groningen



Vergelijkbare documenten
Large scale syntactic annotation for Dutch. Gertjan van Noord University of Groningen

Alpino and Corpus Linguistics

SAMPLE 11 = + 11 = + + Exploring Combinations of Ten + + = = + + = + = = + = = 11. Step Up. Step Ahead

FOR DUTCH STUDENTS! ENGLISH VERSION NEXT PAGE. Toets Inleiding Kansrekening 1 8 februari 2010

Example. Dutch language lesson. Dutch & German Language Education Pieter Wielick

LDA Topic Modeling. Informa5ekunde als hulpwetenschap. 9 maart 2015

Four-card problem. Input

Grammars and automatic syntactic analysis

Preschool Kindergarten

Group work to study a new subject.

Handleiding Zuludesk Parent

Travel Survey Questionnaires

(1) De hoofdfunctie van ons gezelschap is het aanbieden van onderwijs. (2) Ons gezelschap is er om kunsteducatie te verbeteren

MyDHL+ Van Non-Corporate naar Corporate

SHICO: SHIFTING CONCEPTS OVER TIME

1. In welk deel van de wereld ligt Nederland? 2. Wat betekent Nederland?

Interaction Design for the Semantic Web

Add the standing fingers to get the tens and multiply the closed fingers to get the units.

0515 FOREIGN LANGUAGE DUTCH

Data Handling Ron van Lammeren - Wageningen UR

Introductie in flowcharts

Luister alsjeblieft naar een opname als je de vragen beantwoordt of speel de stukken zelf!

RECEPTEERKUNDE: PRODUCTZORG EN BEREIDING VAN GENEESMIDDELEN (DUTCH EDITION) FROM BOHN STAFLEU VAN LOGHUM

GernEdiT The GermaNet Editing Tool

FOR DUTCH STUDENTS! ENGLISH VERSION NEXT PAGE

BE Nanoregistry Annual Public Report

ANGSTSTOORNISSEN EN HYPOCHONDRIE: DIAGNOSTIEK EN BEHANDELING (DUTCH EDITION) FROM BOHN STAFLEU VAN LOGHUM

AE1103 Statics. 25 January h h. Answer sheets. Last name and initials:

Comics FILE 4 COMICS BK 2

Cambridge Assessment International Education Cambridge International General Certificate of Secondary Education. Published

LONDEN MET 21 GEVARIEERDE STADSWANDELINGEN 480 PAGINAS WAARDEVOLE INFORMATIE RUIM 300 FOTOS KAARTEN EN PLATTEGRONDEN

The first line of the input contains an integer $t \in \mathbb{n}$. This is followed by $t$ lines of text. This text consists of:

FOR DUTCH STUDENTS! ENGLISH VERSION NEXT PAGE

0515 DUTCH (FOREIGN LANGUAGE)

2010 Integrated reporting

class book I am reading a book. close your books homework My teacher gave me a lot of homework. to read We are going to read that book.

Chapter 4 Understanding Families. In this chapter, you will learn

Natuurlijke-taalverwerking

Competencies atlas. Self service instrument to support jobsearch. Naam auteur

Understanding and being understood begins with speaking Dutch

3 I always love to do the shopping. A Yes I do! B No! I hate supermarkets. C Sometimes. When my mother lets me buy chocolate.

Engels op Niveau A2 Workshops Woordkennis 1

3HUIRUPDQFH0HDVXUHPHQW RI'\QDPLFDOO\&RPSLOHG -DYD([HFXWLRQV

Digital municipal services for entrepreneurs

Online Resource 1. Title: Implementing the flipped classroom: An exploration of study behaviour and student performance

Ontpopping. ORGACOM Thuis in het Museum

FOR DUTCH STUDENTS! ENGLISH VERSION NEXT PAGE. Toets Inleiding Kansrekening 1 22 februari 2013

Aim of this presentation. Give inside information about our commercial comparison website and our role in the Dutch and Spanish energy market

C-value method for multi-word term extraction

Studentnummer: Inleiding Taalkunde 2013 Eindtoets Zet op ieder vel je naam en studentnummer!

Opgave 2 Geef een korte uitleg van elk van de volgende concepten: De Yield-to-Maturity of a coupon bond.

S e v e n P h o t o s f o r O A S E. K r i j n d e K o n i n g

open standaard hypertext markup language internetprotocol transmission control protocol internet relay chat office open xml

20 twenty. test. This is a list of things that you can find in a house. Circle the things that you can find in the tree house in the text.

Het beheren van mijn Tungsten Network Portal account NL 1 Manage my Tungsten Network Portal account EN 14

NEDERBOOMS D3.1 Case Study on NP/PP Alternation

Global TV Canada s Pulse 2011

Parse and Corpus-based Machine Translation. STEVIN Programmadag

Classification of triangles

Main language Dit is de basiswoordenschat. Deze woorden moeten de leerlingen zowel passief als actief kennen.

International Leiden Leadership Programme

NETWORK CHARTER. #ResourceEfficiency

possessive determiners

Lichamelijke factoren als voorspeller voor psychisch. en lichamelijk herstel bij anorexia nervosa. Physical factors as predictors of psychological and

Settings for the C100BRS4 MAC Address Spoofing with cable Internet.

Duiding Strafuitvoering (Larcier Duiding) (Dutch Edition) Click here if your download doesn"t start automatically

Quick scan method to evaluate your applied (educational) game. Validated scales from comprehensive GEM (Game based learning Evaluation Model)

FOR DUTCH STUDENTS! ENGLISH VERSION NEXT PAGE

Homework assignment 7 (Intensionality)

Relatie tussen Persoonlijkheid, Opleidingsniveau, Leeftijd, Geslacht en Korte- en Lange- Termijn Seksuele Strategieën

After that, the digits are written after each other: first the row numbers, followed by the column numbers.

B1 Woordkennis: Spelling

Overzicht. Question Answering en Informatie Extractie. Joost: a QA system for Dutch. Question Answering

9 daagse Mindful-leSs 3 stappen plan training

L.Net s88sd16-n aansluitingen en programmering.

NMOZTMKUDLVDKECVLKBVESBKHWIDKPDF-WWUS Page File Size 9,952 KB 29 May, 2016

!!!! Wild!Peacock!Omslagdoek!! Vertaling!door!Eerlijke!Wol.!! Het!garen!voor!dit!patroon!is!te!verkrijgen!op! Benodigdheden:!!

Datamodelleren en databases 2011

Geslacht, Emotionele Ontrouw en Seksdrive. Gender, Emotional Infidelity and Sex Drive

liniled Cast Joint liniled Gietmof liniled Castjoint

Teksten van de liederen die gospelkoor Inspiration tijdens deze Openluchtdienst zingt.

De grondbeginselen der Nederlandsche spelling / Regeling der spelling voor het woordenboek der Nederlandsche taal (Dutch Edition)

MyDHL+ ProView activeren in MyDHL+

Bijlage 2: Informatie met betrekking tot goede praktijkvoorbeelden in Londen, het Verenigd Koninkrijk en Queensland

University of Groningen Educational value of digital examination

Value based healthcare door een quality improvement bril

Davide's Crown Caps Forum

8+ 60 MIN Alleen te spelen in combinatie met het RIFUGIO basisspel. Only to be played in combination with the RIFUGIO basicgame.

How to install and use dictionaries on the ICARUS Illumina HD (E652BK)

z x 1 x 2 x 3 x 4 s 1 s 2 s 3 rij rij rij rij

Het Huis Van De Moskee Kader Abdolah

Puzzle. Fais ft. Afrojack Niveau 3a Song 6 Lesson A Worksheet. a Lees de omschrijvingen. Zet de Engelse woorden in de puzzel.

OPEN TRAINING. Onderhandelingen met leveranciers voor aankopers. Zeker stellen dat je goed voorbereid aan de onderhandelingstafel komt.

LinkedIn Profiles and personality

MyDHL+ Uw accountnummer(s) delen

Concept of Feedback. P.S. Gandhi Mechanical Engineering IIT Bombay

The genesis of the game is unclear. Possibly, dominoes originates from China and the stones were brought here by Marco Polo, but this is uncertain.

Relationele Databases 2002/2003

Relationele Databases 2002/2003

Transcriptie:

Large scale syntactic annotation for Dutch Gertjan van Noord University of Groningen

Context: Wide-coverage Parsing 1 Assign syntactic structure to sentence Neccessary step to determine the meaning

Context: Wide-coverage Parsing (2) 2 Met de verrekijker zie ik de man De man met de verrekijker zie ik

3 top smain top smain mod pp verb zie 4 su pron ik 5 obj1 np obj1 np verb zie 6 su pron ik 7 prep met 1 obj1 np det det de 6 noun man 7 det det de 1 noun man 2 mod pp det det de 2 noun verrekijke 3 prep met 3 obj1 np det det de 4 noun verrekijke 5

Context: Wide-coverage Parsing (3) 4 Dit is de vrouw die de mannen hebben gezien Dit is de vrouw die de mannen heeft gezien

5 top smain top smain su det dit 1 verb ben 2 predc np su det dit 1 verb ben 2 predc np det det de 3 noun vrouw 4 mod rel det det de 3 noun vrouw 4 mod rel r 4 noun die 5 body ssub r 4 noun die 5 body ssub su 4 verb heb 8 vc ppart su 6 np verb heb 8 vc ppart su 4 obj1 np verb zie 9 det det de 6 noun man 7 obj1 4 su 6 verb zie 9 det det de 6 noun man 7

Parsing: state of the art 6 Full parsing is fragile slow inaccurate

Parsing: state of the art 6 Full parsing is fragile slow inaccurate This is no longer true! Improvements: robustness efficiency disambiguation

Parsing: state of the art 6 Full parsing is fragile slow inaccurate This is no longer true! Improvements: robustness efficiency disambiguation Corpora!

Syntactic Annotation - past 7 Penn Treebank (1989) By linguists (students) Resource for NLP research/development: Train (statistical) models Evaluate (statistical) models Revolution in NLP

Syntactic Annotation - this talk 8 for Dutch manually corrected By linguists (students) Alpino parser and related tools fully automatically Alpino parser and related tools Huge Many more applications

Overview 9 Syntactically annotated corpora are great! Small manually corrected treebanks:... for disambiguation in a parser Huge automatically created treebanks:... for improved disambiguation in a parser... for corpus linguistics... for information extraction / question answering

Alpino 10 Parser for Dutch Characteristics: wide-coverage robust accurate Formalism: Stochastic Attribute Value Grammar Linguistic Sophistication Principled Account of Disambiguation Output: CGN Dependency Structures

CGN Dependency Structures 11 CGN: Corpus of Spoken Dutch abstract representation of syntactic analysis de-facto standard hierarchical information: which words belong together relational information:, su, obj1, obj2, pc,... categorial information: np, pp, smain,...

Vier jonge Rotterdammers willen deze zomer per auto naar Japan 12 top smain su np verb wil 3 mod np mod pp ld pp det num vier 0 mod adj jong 1 noun Rotterdammer 2 det det deze 4 noun zomer 5 prep per 6 obj1 noun auto 7 prep naar 8 obj1 name Japan 9

Er was een tijd dat Amerika met bossen overdekt was 13 top smain mod adv er 0 verb ben 1 su np det det een 2 noun tijd 3 vc cp cmp comp dat 4 body ssub su 1 name Amerika 5 vc ppart verb ben 9 obj1 1 mod pp verb overdek 8 prep met 6 obj1 noun bos 7

Extrinsic Motivation 14 corpus # sentences length accuracy % exact % Alpino Treebank (newspaper) 7136 20 89.1 41.5 CLEF questions (tuned for questions) 1745 11 96.3 82.1 D-Coi-Gr Treebank 8857 15 88.4 48.3 D-Coi WR-P-E-E (newsletters) 90 20 81.1 31.1 D-Coi WR-P-P-B (children book) 276 7 93.5 79.0 Accuracy: in terms of named dependencies

Syntactic Analysis in Alpino 15 Lexicon over 200,000 entries (including many named entities) extensive set of heuristics for unseen words and word sequences mapped to attribute-value matrices organized as inheritence network POS-tagger removes unlikely lexical categories Grammar rewrite rules where categories are attribute-value matrices unification rule set organized as inheritence network Parser constructs parse forest: compact representation of all possible parses selects best parse: disambiguation

Ambiguity in Alpino 16 Avg. readings 0 5000 10000 15000 5 10 15 Sentence length (words)

Ambiguity 17 the expected lexical and structural ambiguities many, many, many unexpected, absurd, ambiguities many don t care ambiguities longer sentences have millions of parses

Er was een tijd dat Amerika met bossen overdekt was 18 top smain mod adv er 0 verb ben 1 su np det det een 2 noun tijd 3 vc cp cmp comp dat 4 body ssub su 1 name Amerika 5 vc ppart verb ben 9 obj1 1 mod pp verb overdek 8 prep met 6 obj1 noun bos 7

Er was een tijd dat Amerika met bossen overdekt was 19 top smain mod adv er 0 verb ben 1 su np predc np mod pp det det een 2 noun tijd 3 det det dat 4 name Amerika 5 prep met 6 obj1 np noun bos 7 mod np mod adj overdekt 8 noun was 9

Er was een tijd dat Amerika met bossen overdekt was 20 top smain mod adv er 0 verb ben 1 su np det det een 2 noun tijd 3 vc cp cmp comp dat 4 body ssub su name Amerika 5 mod pp predc adj overdekt 8 verb ben 9 prep met 6 obj1 noun bos 7

Vier jonge Rotterdammers willen deze zomer per auto naar Japan 21 top sv1 verb vier 0 obj1 np mod np mod pp obj1 np verb wil 3 det det deze 4 noun zomer 5 prep per 6 obj1 np mod adj jong 1 noun Rotterdammer 2 noun auto 7 mod pp prep naar 8 obj1 name Japan 9

Door de overboeking vertrok een groep toeristen uit het hotel 22 top smain mod pp verb vertrek 3 su np ld pp prep door 0 obj1 np det det een 4 noun groep 5 mod noun toerist 6 prep uit 7 obj1 np det det de 1 noun over boeking 2 det det het 8 noun hotel 9 Zempléni: unambiguously literal sentence Alpino: 13 parses

Door de overboeking vertrok een groep toeristen uit het hotel 23 top smain mod pp verb vertrek 3 su np obj1 np prep door 0 obj1 np det det een 4 noun groep 5 noun toerist 6 mod pp det det de 1 noun over boeking 2 prep uit 7 obj1 np det det het 8 noun hotel 9

Disambiguation Model 24 Identify features for disambiguation: arbitrary characteristics of parses Training the model: assign a weight to each feature, by increase weights of features in the correct parse decrease weights of features in incorrect parses Applying the model: For each parse, sum weights of features occurring in it Select parse with highest sum Maximum Entropy

Training 25 Requires a corpus of correct and incorrect parses Alpino Treebank: newspaper-part (cdbl) of Eindhoven corpus 145.000 words manually checked syntactic annotations (Leonoor van der Beek,... ) CGN Dependency Structures Generate all parses with Alpino, and use the treebank to classify each parse

Features 26 Describe arbitrary properties of parses Need not be independent of each other Can encode a variety of linguistic (and other) preferences Linguistic Insights!

Features templates 27 r1(rule) r2(rule,n,subrule) r2 root(rule,n,word) r2 frame(rule,n,frame) r3(rule,n,word) mf(cat1,cat2) f1(pos) f2(word,pos) h(heur) Rule has been applied The N-th daughter of Rule is constructed by SubRule The N-th daughter of Rule is Word The N-th daugther of Rule is a word with subcat frame Frame The N-th daughter of Rule is headed by Word Cat1 precedes Cat2 in the mittelfeld POS-tag Pos occurs Word has POS-tag Pos unknown word heuristic Heur has been applied

Dependency feature templates 28 dep35(sub,role,word) dep34(sub,role,pos) dep23(subpos,role,pos) Sub is the Role dependent of Word Sub is the Role dependent of a word with POS-tag Pos a word with POS-tag SubPos is the Role dependent of a word with POS-tag Pos

Some non-local features 29 In coordinated structure, the conjuncts are parallel or not In extraction structure, the extraction is local or not In extraction structure, the extracted element is a subject Constituent ordering in mittelfeld pronoun precedes full np accusative pronoun precedes dative pronoun dative full np precedes accusative full np

Features indicating bad parses 30-0.0707213 h1(long) -0.0585366 f2(was,noun) -0.0507852 f2(tot,vg) -0.0497879 h1(decap(not_begin)) -0.0494901 s1(extra_from_topic) -0.0411195 r3(np_det_n,2,was) -0.0410466 f2(op,prep) -0.0372584 f2(kan,noun) -0.0337606 h1(skip)

Features indicating good parses 31 0.0741717 f2(en,vg) 0.064064 dep35(en,vg,/obj1,prep,tussen) 0.0549897 f2(word,verb(passive)) 0.0461192 r2(non_wh_topicalization(np),1,np_pron_weak) 0.039418 s1(subj_topic) 0.0387447 dep23(pron(wkpro,nwh),/su,verb)

Results Parse Selection 32 Alpino treebank ten-fold cross-validation Model should select best parse for each sentence out of maximally 1000 parses per sentence accuracy: proportion of correct named dependencies

Results Parse Selection 33 accuracy % baseline 61.5 oracle 89.2 model 84.0 rate 81.5 exact 55

Wrap up 34 So far: background about Alpino manually annotated treebank to train and test disambiguation component Next: applications of automatically constructed treebanks

Automatically constructed treebanks 35 Corpora automatically annotated with Alpino Parser Twente News Corpus (TwNC) (500M words, newspapers) D-Coi (55M words, including Dutch Wikipedia, Dutch Europarl) LASSY (450M words, to be decided) Interesting Applications...

TwNC 36 #sentences 100% 30,000,000 #words 500,000,000 #sentences without parse 0.2% 100,000 #sentences with fragments 8% 2,500,000 #single full parse 92% 27,500,000

Millions of dependency structures 37 Compressed archives of XML files Pseudo random access dictd gzip Storage requirements: 10% of original Mostly by Geert Kloosterman

Example 38 <?xml version="1.0" encoding="iso-8859-1"?> <top> <node rel="top" cat="smain" begin="0" end="10"> <node rel="su" frame="determiner(het,nwh,nmod,pro,nparg)" pos="det" begin="0" end="1" root="dat" word=" <node rel="" frame="verb(hebben,past(sg),transitive)" pos="verb" begin="1" end="2" root="wek" word="w <node rel="obj1" cat="np" begin="2" end="10"> <node rel="det" frame="determiner(de)" pos="det" begin="2" end="3" root="de" word="de" infl="de"/> <node rel="" frame="noun(de,both,sg)" pos="noun" begin="3" end="4" root="woede" word="woede" gen="d <node rel="mod" cat="pp" begin="4" end="10">... </node> </node> </node> <sentence>dat wekte de woede van Turkse inwoners van de wijk.</sentence> <comments> <comment>q#ad19940103-0125-776-2 Dat wekte de woede van Turkse inwoners van de wijk. 1 1-0.0396969573 </comments> </top>

Treebank Tools 39 DtView DtEdit DtSearch

DtView 40

DtSearch 41 XPATH standard Search queries hierarchical relations grammatical relations syntactic category surface order lemma, other attributes Matches: display sentence display sentence with brackets display matching part of sentence your own style-sheets

DtSearch Example 42 dtsearch -s -q //node[../@cat="smain" and @rel="obj2" and not(@cat="pp") and./@begin =../@begin]. [Haar] ging het goed af. " [Ons] staat helemaal geen Big Brother-scenario voor ogen. [Ook hun] past enige schroom. [Zelfs de bloeddorstigste tegenstander] adviseerde hij nog zijn gedrag wat aan te passen. [Die] geef ik voor de wedstrijd een zoen...

Application: Selection Restrictions for Improved Disambiguation 43

Application: Selection Restrictions for Improved Disambiguation 43 Use automatically parsed corpus to learn selection restrictions Bier drinkt de vrouw Beer, the woman drinks Lexical features: dep35(woman,obj1,drink) dep35(beer,su,drink) dep35(woman,su,drink) dep35(beer,obj1,drink) Such features are too infrequent to be useful; the training corpus is too smal to estimate weights for those features

Some Actually Occurring Bad Parses 44 (1) a. Campari moet u gedronken hebben Campari must have drunk you You must have drunk Campari b. De wijn die Elvis zou hebben gedronken als hij wijn zou hebben gedronken The wine Elvis would have drunk if he had drunk wine The wine that would have drunk Elvis if he had drunk wine c. De paus heeft tweehonderd daklozen te eten gehad The pope had twohunderd homeless people for dinner

Extract lexical dependencies 45 top whq w 1 conj body sv1 cnj adv waar 0 crd vg en 1 cnj adv wanneer 2 mod 1 verb drink 3 su name Elvis 4 obj1 noun wijn 5 crd/cnj(en, waar) w/body(en, drink) /obj1(drink, wijn) crd/cnj(en, wanneer) /mod(drink, en) /su(drink, Elvis)

Number of lexical dependencies 46 tokens 480,000,000 types 100,000,000 types with frequency 20 2,000,000

Bilexical preference 47 Pointwise Mutual Information (Fano 1961, Church and Hanks 1990) I(r(w 1, w 2 )) = log compare actual frequency with expected frequency Example: I(/obj1(drink, melk)) f(/obj1(drink, melk)): 195 f(/obj1(drink, )): 15713 f( (, melk)): 10172 expected: 0.34 actual frequency is about 560 times as big its log: 6.3 f(r(w 1, w 2 )) f(r(w 1, ))f( (, w 2 ))

Examples of high bilexical preferences 48 bijltje gooi neer 13 duimschroef draai aan 13 peentje zweet 13 traantje pink weg 13 boontje dop 12 centje verdien bij 12 champagne fles ontkurk 12 dorst les 12

Examples of high scoring objects of drink 49 biertje small glass of beer 8 borreltje strong alcoholic drink 8 glaasje small glass 8 pilsje small glass of beer 8 pintje small glass of beer 8 pint glass of beer 8 wijntje small glass of wine 8 alcohol alcohol 7 bier beer 7

Lexical preferences between verbs and modifiers 50 overlangs snijd door 12 welig tier 12 dunnetjes doe over 11 stief moederlijk bedeel 11 on zedelijk betast 11 stierlijk verveel 11 cum laude studeer af 10 hermetisch grendel af 10 ingespannen tuur 10 instemmend knik 10 kostelijk amuseer 10

Lexical preferences between nouns and adjectives 51 endoplasmatisch zelfrijzend waterbesparende ongeblust onbevlekt ingegroeid knapperend geconsacreerde bezittelijk pientere afgescheurde beklemtoond reticulum bakmeel douchekop kalk ontvangenis teennagel haardvuur hostie voornaamwoord pookje kruisband lettergreep

Can you guess? 52 put

Can you guess? 52 put sponde bodemloze

Can you guess? 52 put sponde bandiet bodemloze echtelijke

Can you guess? 52 put sponde bandiet zelfverrijking bodemloze echtelijke eenarmige

Can you guess? 52 put sponde bandiet zelfverrijking vuist bodemloze echtelijke eenarmige exhibitionistische

Can you guess? 52 put sponde bandiet zelfverrijking vuist wenkbrauw bodemloze echtelijke eenarmige exhibitionistische gebalde

Can you guess? 52 put sponde bandiet zelfverrijking vuist wenkbrauw nonsens bodemloze echtelijke eenarmige exhibitionistische gebalde gefronst

Can you guess? 52 put sponde bandiet zelfverrijking vuist wenkbrauw nonsens veldtocht bodemloze echtelijke eenarmige exhibitionistische gebalde gefronst baarlijke, klinkklare

Can you guess? 52 put sponde bandiet zelfverrijking vuist wenkbrauw nonsens veldtocht bodemloze echtelijke eenarmige exhibitionistische gebalde gefronst baarlijke, klinkklare tiendaagse

Using association scores as disambiguation features 53 new features z(p, r) for each POS-tag p and dependency r

Using association scores as disambiguation features 53 new features z(p, r) for each POS-tag p and dependency r if there is a r-dependency between word w 1 (with Pos-tag p) and word w 2

Using association scores as disambiguation features 53 new features z(p, r) for each POS-tag p and dependency r if there is a r-dependency between word w 1 (with Pos-tag p) and word w 2 the count of this feature is given by I(r(w 1, w 2 ))

Using association scores as disambiguation features 53 new features z(p, r) for each POS-tag p and dependency r if there is a r-dependency between word w 1 (with Pos-tag p) and word w 2 the count of this feature is given by I(r(w 1, w 2 )) only for positive I

Using association scores as disambiguation features 53 new features z(p, r) for each POS-tag p and dependency r if there is a r-dependency between word w 1 (with Pos-tag p) and word w 2 the count of this feature is given by I(r(w 1, w 2 )) only for positive I NB: limited number of features; treebank large enough to estimate their weights

Example 54 Melk drinkt de baby niet Milk, the baby does not drink Analysis 1: z(verb,/obj1)=6 z(verb,/su)=3 Analysis 2: z(verb,/obj1)=0 z(verb,/su)=0 weight z(verb,/obj1): 0.0101179 weight z(verb,/su): 0.00877976

Experiment 1 55 ten-fold cross validation Alpino Treebank fscore err.red. exact CA % % % % standard 87.41 74.60 52.0 87.02 +self-training 87.91 77.38 54.8 87.51

Experiment 2 56 Full system D-Coi Treebank (Trouw newspaper) prec rec fscore CA % % % % standard 90.77 90.49 90.63 90.32 +self-training 91.19 90.89 91.01 90.73

Application: Extraposition of comparatives out of topic 57

Application: Extraposition of comparatives out of topic 57 Reviewer: extraposition of comparative out of topic is impossible: *Lager was de koers nog nooit dan bij opening Alpino grammar allows this We can search for the relevant pattern

Dependency Structure 58 top smain predc ap verb ben 1 su np mod adv nog 4 mod adv nooit 5 adj laag 0 obcomp cp det det de 2 noun koers 3 cmp comparative dan 6 body pp prep bij 7 obj1 noun opening 8

DtSearch queries 59 //node[@cat="smain" and./node[./node[@rel="obcomp"]]/@begin = @begin] //node[@cat="smain" and./node[./node[@rel="obcomp"] /@end >../node[@rel=""]/@begin ]/@begin = @begin]

Extraposed obcomp out of topic 60 Liever benadrukt hij die tegenstellingen dan de bedriegelijke harmonie Nog eerder zal de machtige Mekong droogvallen dan dat de co-premier zijn macht uit handen geeft Zo intens lelijk zijn mijn voeten in de loop van een decennium geworden dat ik de mensenmassa s op het strand er in de zomer niet mee wil lastigvallen Eerder brengt men een hemel vol wolken in kaart dan dit oeuvre Veel eerder vindt er een herschikking in het midden plaats dan dat er werkelijk massaal uit dat midden wordt gevlucht Eerder is er sprake van het kabinet ondanks Kok dan het kabinet-kok Liever sluis ik honderden en honderden guldens door aan loodgieter, fietsenmaker en elektricien dan dat ik zelf ook maar één vinger uitsteek naar het fonteintje bij het toilet, een kapot achterlicht of een weigerende stofzuiger liever waren ze onafhankelijk dan dat ze zich aan iemand bonden Liever is Jim schuldig aan een sprong, dan de prooi van een aanvechting

eerder gaat zoo n kameel door het oog van een naald, dan dat een rijke in zou gaan in het koninkrijk der hemelen 61

Application: Question Answering, and Similar Words 62

Application: Question Answering, and Similar Words 62 (2) By whom was John Lennon killed? (3) Where was he killed? (4) How often was he hit? (5) What are Google-bombs? (6) How high is the Dom-tower in Utrecht (7) In what year did its construction start? (8) Who was the first architect?

Background 63 QA-system based on Alpino: JOOST Best result in CLEF2005 for Dutch; third result overall Best result in CLEF2006 for Dutch; Dutch was made more difficult than other languages No results known yet for CLEF2007

Background 63 QA-system based on Alpino: JOOST Best result in CLEF2005 for Dutch; third result overall Best result in CLEF2006 for Dutch; Dutch was made more difficult than other languages No results known yet for CLEF2007

Strategy 64 Analyse the question into a dependency structure Compare dependency structure with dependency structures of all potential answers Potential answers are paragraphs returned by IR from newspaper texts and Dutch Wikipedia

Strategy 64 Analyse the question into a dependency structure Compare dependency structure with dependency structures of all potential answers Potential answers are paragraphs returned by IR from newspaper texts and Dutch Wikipedia Use many other techniques in addition Ontological information

Ontological information for QA 65 (9) Who is Javier Solana? (10) Which soccer player won the Golden Bal in 1999? (11) In which American state is Iron Mountain? (12) Which French president opened the Channel Tunnel?

Discover Ontological Information 66 Similar words occur in similar contexts Dependency relations: more fine-grained notion of context subject-verb verb-object adjective-noun coordination apposition prepositional complement

Vectors describing contexts 67 Every word is represented by an n-dimensional vector Every dimension is a context characteristic Every cell is a (function of the corresponding) frequency zie.obj verf.obj verzorg.obj laat uit.obj... bus 50 5 1 0... hond 56 1 5 8... truck. 43 4 0 0...

Similarity Measure 68 Dice: i 2 min(v i, w i ) v i + w i other possibilities...

Feature Weights 69 frequency mutual information other possibilities...

Data used 70 subject-verb 5,639,140 verb-object 2,642,356 adjective-noun 3,262,403 coordination 965,296 apposition 526,337 prepositional complement 770,631

Results for BMW 71 Volkswagen, Mercedes, Honda, Chrysler, Audi, Volvo, Ford, Toyota, Fiat, Peugeot, Opel, Mitsubishi, Renault, Mazda, Jaguar, General Motors, Rover, Nissan, VW, Porsche

Results for Sony 72 Matsushita, Toshiba, Time Warner, JVC, Hitachi, Nokia, Samsung, Motorola, Philips, Siemens, Apple, Canon, IBM, PolyGram, Thomson, Mitsubishi, Kodak, Pioneer, AT&T, Sharp

Hinault 73 Kübler, Vermandel, Bruyère, Depredomme, Mottiat, Merckx, Depoorter, De Bruyne, Argentin, Schepers, Criquielion, Dierickx, Van Steenbergen, Kint, Bartali, Ockers, Coppi, Fignon, Kelly, De Vlaeminck

Beatles 74 Rolling Stones, Stones, John Lennon, Jimi Hendrix, Tina Turner, Bob Dylan, Elvis Presley, Michael Jackson, The Beatles, David Bowie, Prince, Genesis, Mick Jagger, The Who, Elton John, Barbra Streisand, Led Zeppelin, Eric Clapton, Diana Ross, Janis Joplin

Paris 75 Londen, Brussel, Moskou, Washington, Berlijn, New York, Rome, Madrid, Bonn, Wenen, Peking, Frankfurt, Athene, Tokio, München, Barcelona, Praag, Antwerpen, Stockholm, Tokyo

Grenoble 76 Rouen, Saint Etienne, Pau, Saint-Etienne, Rennes, Marne-la-Vallée, Aix, Orléans, Toulouse, Montpellier, Amiens, Strasbourg, Lyon, Lens, Avignon, Clermont-Ferrand, Straatsburg, Caen, Bayonne, Limoges

Results for Wim Kok 77 Elco Brinkman, Frits Bolkestein, Hans van Mierlo, W. Kok, Kok, Ruud Lubbers, Den Uyl, John Major, Jacques Wallage, Wallage, Thijs Wöltgens, Hedy d Ancona, Relus ter Beek, Klaus Kinkel, Balladur, Kinkel, Van Mierlo, Jacques Chirac, Kooijmans, Jan Pronk

huis (house) 78 woning, gebouw, pand, auto, straat, kantoor, kamer, boerderij, tuin, winkel, kerk, brug, huisje, appartement, hotel, flat, muur, boom, paleis, villa house, building, house, car, street, office, room, farm, garden, shop, church, bridge, small house, appartment, hotel, flat, wall, tree, palace, villa

verliefdheid (enamour, love) 79 jaloezie, verraad, afgunst, weerzin, romance, hartstocht, overspel, passie, erotiek, vriendschap, obsessie, schuldgevoelen, fascinatie, vergankelijkheid, seksualiteit, animositeit, seks, lust, verlangen, zeeroof jealousy, treason, envy, dislike, romance, passion, adultery, passion, erotics, friendship, obsession, feelings of guilt, fascination, transiency, sexuality, animosity, sex, lust, desire, piracy

witlof 80 broccoli, prei, spruitje, knolselderij, andijvie, courgette, sperzieboon, zuurkool, worteltje, bleekselderij, bloemkool, snijboon, aubergine, peen, zilveruitje, ijsbergsla, koolsoort, winterpeen, doperwtjes, komkommer broccoli, leek, sprout, celeriac, endive, zucchini, butter bean, sauerkraut, carrot, blanched celery, cauliflower, haricot, aubergine, carrot, onion, iceberg lettuce, cabbage, carrot, peas, cucumber

Conclusion 81 Syntactically annotated corpora are perhaps potentially somewhat useful

It s Free! 82 http://www.let.rug.nl/vannoord/alp/alpino/ http://www.let.rug.nl/vannoord/trees/ http://www.let.rug.nl/vdplas/sets/browse.php http://www.let.rug.nl/gosse/sets/