C-value method for multi-word term extraction

Vergelijkbare documenten

SAMPLE 11 = + 11 = + + Exploring Combinations of Ten + + = = + + = + = = + = = 11. Step Up. Step Ahead

Add the standing fingers to get the tens and multiply the closed fingers to get the units.

Luister alsjeblieft naar een opname als je de vragen beantwoordt of speel de stukken zelf!

LDA Topic Modeling. Informa5ekunde als hulpwetenschap. 9 maart 2015

Chapter 4 Understanding Families. In this chapter, you will learn

FOR DUTCH STUDENTS! ENGLISH VERSION NEXT PAGE. Toets Inleiding Kansrekening 1 8 februari 2010

MyDHL+ Van Non-Corporate naar Corporate

ALGORITMIEK: answers exercise class 7

Classification of triangles

Engels op Niveau A2 Workshops Woordkennis 1

Question-Driven Sentence Fusion is a Well-Defined Task. But the Real Issue is: Does it matter?

2019 SUNEXCHANGE USER GUIDE LAST UPDATED

Bin packing and scheduling

MyDHL+ ProView activeren in MyDHL+

(1) De hoofdfunctie van ons gezelschap is het aanbieden van onderwijs. (2) Ons gezelschap is er om kunsteducatie te verbeteren

General info on using shopping carts with Ingenico epayments

Pesten onder Leerlingen met Autisme Spectrum Stoornissen op de Middelbare School: de Participantrollen en het Verband met de Theory of Mind.

Settings for the C100BRS4 MAC Address Spoofing with cable Internet.

Preschool Kindergarten

Four-card problem. Input

After that, the digits are written after each other: first the row numbers, followed by the column numbers.

TECHNISCHE UNIVERSITEIT EINDHOVEN Faculteit Wiskunde en Informatica. Examination 2DL04 Friday 16 november 2007, hours.

Handleiding Installatie ADS

Group work to study a new subject.

FOD VOLKSGEZONDHEID, VEILIGHEID VAN DE VOEDSELKETEN EN LEEFMILIEU 25/2/2016. Biocide CLOSED CIRCUIT

Calculator spelling. Assignment

The genesis of the game is unclear. Possibly, dominoes originates from China and the stones were brought here by Marco Polo, but this is uncertain.

04/11/2013. Sluitersnelheid: 1/50 sec = 0.02 sec. Frameduur= 2 x sluitersnelheid= 2/50 = 1/25 = 0.04 sec. Framerate= 1/0.

LONDEN MET 21 GEVARIEERDE STADSWANDELINGEN 480 PAGINAS WAARDEVOLE INFORMATIE RUIM 300 FOTOS KAARTEN EN PLATTEGRONDEN

Bijlage 2: Informatie met betrekking tot goede praktijkvoorbeelden in Londen, het Verenigd Koninkrijk en Queensland

TOEGANG VOOR NL / ENTRANCE FOR DUTCH : lator=c&camp=24759

FOR DUTCH STUDENTS! ENGLISH VERSION NEXT PAGE

The first line of the input contains an integer $t \in \mathbb{n}$. This is followed by $t$ lines of text. This text consists of:

Data Handling Ron van Lammeren - Wageningen UR

SHICO: SHIFTING CONCEPTS OVER TIME

Cambridge Assessment International Education Cambridge International General Certificate of Secondary Education. Published

8+ 60 MIN Alleen te spelen in combinatie met het RIFUGIO basisspel. Only to be played in combination with the RIFUGIO basicgame.

Firewall van de Speedtouch 789wl volledig uitschakelen?

Genetic code. Assignment

Appendix A: List of variables with corresponding questionnaire items (in English) used in chapter 2

Het beheren van mijn Tungsten Network Portal account NL 1 Manage my Tungsten Network Portal account EN 14

0515 DUTCH (FOREIGN LANGUAGE)

GOVERNMENT NOTICE. STAATSKOERANT, 18 AUGUSTUS 2017 No NATIONAL TREASURY. National Treasury/ Nasionale Tesourie NO AUGUST

How to install and use dictionaries on the ICARUS Illumina HD (E652BK)

z x 1 x 2 x 3 x 4 s 1 s 2 s 3 rij rij rij rij

Eye Feature Detection Towards Automatic Strabismus Screening

Intercultural Mediation through the Internet Hans Verrept Intercultural mediation and policy support unit

FOR DUTCH STUDENTS! ENGLISH VERSION NEXT PAGE

FOR DUTCH STUDENTS! ENGLISH VERSION NEXT PAGE

FOR DUTCH STUDENTS! ENGLISH VERSION NEXT PAGE. Toets Inleiding Kansrekening 1 7 februari 2011

Quality requirements concerning the packaging of oak lumber of Houthandel Wijers vof ( )

ETS 4.1 Beveiliging & ETS app concept

B1 Woordkennis: Spelling

ANGSTSTOORNISSEN EN HYPOCHONDRIE: DIAGNOSTIEK EN BEHANDELING (DUTCH EDITION) FROM BOHN STAFLEU VAN LOGHUM

Introductie in flowcharts

Example. Dutch language lesson. Dutch & German Language Education Pieter Wielick

NMOZTMKUDLVDKECVLKBVESBKHWIDKPDF-WWUS Page File Size 9,952 KB 29 May, 2016

I.S.T.C. Intelligent Saving Temperature Controler

OUTDOOR HD BULLET IP CAMERA PRODUCT MANUAL

Opgave 2 Geef een korte uitleg van elk van de volgende concepten: De Yield-to-Maturity of a coupon bond.

L.Net s88sd16-n aansluitingen en programmering.

Shipment Centre EU Quick Print Client handleiding [NL]

Karen J. Rosier - Brattinga. Eerste begeleider: dr. Arjan Bos Tweede begeleider: dr. Ellin Simon

Handleiding Zuludesk Parent

!!!! Wild!Peacock!Omslagdoek!! Vertaling!door!Eerlijke!Wol.!! Het!garen!voor!dit!patroon!is!te!verkrijgen!op! Benodigdheden:!!

Ae Table 1: Aircraft data. In horizontal steady flight, the equations of motion are L = W and T = D.

Lichamelijke factoren als voorspeller voor psychisch. en lichamelijk herstel bij anorexia nervosa. Physical factors as predictors of psychological and

OPEN TRAINING. Onderhandelingen met leveranciers voor aankopers. Zeker stellen dat je goed voorbereid aan de onderhandelingstafel komt.

i(i + 1) = xy + y = x + 1, y(1) = 2.

CHROMA STANDAARDREEKS

Borstkanker: Stichting tegen Kanker (Dutch Edition)

Ontpopping. ORGACOM Thuis in het Museum

RECEPTEERKUNDE: PRODUCTZORG EN BEREIDING VAN GENEESMIDDELEN (DUTCH EDITION) FROM BOHN STAFLEU VAN LOGHUM

Innovaties in de chronische ziekenzorg 3e voorbeeld van zorginnovatie. Dr. J.J.W. (Hanneke) Molema, Prof. Dr. H.J.M.

Het is geen open boek tentamen. Wel mag gebruik gemaakt worden van een A4- tje met eigen aantekeningen.

CBSOData Documentation

Competencies atlas. Self service instrument to support jobsearch. Naam auteur

0515 FOREIGN LANGUAGE DUTCH

Basic operations Implementation options

Surveys: drowning in data?

DALISOFT. 33. Configuring DALI ballasts with the TDS20620V2 DALI Tool. Connect the TDS20620V2. Start DALISOFT

Installatie van Windows 10 op laptops. Windows 10 installation on laptops

My Inspiration I got my inspiration from a lamp that I already had made 2 years ago. The lamp is the you can see on the right.

Global TV Canada s Pulse 2011

TRYPTICH PROPOSED AMENDMENT TO THE ARTICLES OF ASSOCIATION ("AMENDMENT 2") ALTICE N.V.

Zo werkt het in de apotheek (Basiswerk AG) (Dutch Edition)

3e Mirror meeting pren April :00 Session T, NVvA Symposium

Datamodelleren en databases 2011

Understanding and being understood begins with speaking Dutch

Activant Prophet 21. Prophet 21 Version 12.0 Upgrade Information

Academisch schrijven Inleiding

Using association statistics to identify fixed expressions

Chromosomal crossover

Ir. Herman Dijk Ministry of Transport, Public Works and Water Management

Impact en disseminatie. Saskia Verhagen Franka vd Wijdeven

Tentamen Objectgeorienteerd Programmeren

Kritisch Denken van Informatie

L.Net s88sd16-n aansluitingen en programmering.

Researchcentrum voor Onderwijs en Arbeidsmarkt The role of mobility in higher education for future employability

Intermax backup exclusion files

Transcriptie:

C-value method for multi-word term extraction Ismail Fahmi i.fahmi@rug.nl Alfa-informatica, RuG Seminar in Statistics and Methodology May 23, 2005 1

Agenda Example and Objective Terms vs Words Methods of Term Extraction C-value Method Experiment: Corpus Pre-processing Filtering Calculate C-value Evaluation Conclusion 2

Example Given a sentence from the Elsevier's Medical Encyclopedia: Bij chronische myeloide leukemie bestaan de klachten meestal uit vermoeidheid (als gevolg van bloedarmoede) en een opgezette milt. In case of chronic myeloide leukaemia the complaints generally consist of fatigue (as a result of anaemia) and a swollen milt. We want to extract terms from the sentence: chronische myeloide leukemie, bloedarmoede 3

Terms vs Words Terms:... refer to a defined concept... (ISO704). represent a limited number of part-ofspeeches: nouns, verbs, adjectives, and adverbs. Words: have functions in general reference... (Sager, 1990). represent all part-of-speeches. 4

Characteristics of Terms Kageura et al. 1996: Unithood: the degree of strength or stability of syntagmatic combinations or collocation. Termhood: the degree to which a linguistic unit is related to domain-specific concepts. 5

Methods of Term Extraction (1) Linguistic approaches, using linguistic filters, e.g.: Noun + Noun Adj + Noun Noun + suffix (-ase, -in) ( (Adj N)+ ( ( (Adj N)* (N Prep)? ) (Adj N)* ) ) N 6

Methods of Term Extraction (2) Statistical approaches: Unithood: Association ratio (close to the mutual information) Loglikelihood extracting word bigrams Termhood: compound-noun-based measure, Imp (Nakagawa, 1998) C-value (Frantzi et al., 1996) 7

C-value Method (1) C-value(a) = { log 2 a. f a SvOutPlaceObject log 2 a f a 1 P Ta b T a f b if a is not nested, otherwise. a b a f(a) Ta P(Ta) f(b) = candidate term (e.g., myeloide leukemie) = longer candidate terms (e.g., chronische myeloide leukemie) = length of the candidate term (number of words) = frequency of occurrence of a in the corpus = set of extracted candidate terms that contain a = number of candidate terms in Ta = frequency of occurrence of longer candidate term b in the corpus. 8

C-value method (2) C-value method: assigns a termhood to a candidate string using statistical characteristics of the candidate string: frequency of occurrence in the corpus frequency as part of other longer candidate terms number of these longer candidate terms the length of the candidate string (words) Furthermore, C-value method: enhances the common statistical measure (frequency of occurrence) frequency of occurrence only best for high frequency terms makes the measure sentitive to a particular type of multi-word terms, the nested terms C-value gives more weight to the nested terms 9

Experiment: Calculating C-value Step 1: Tag the corpus Step 2: Extract strings using a linguistic filter Step 3: Calculate the C-value 10

Step 1: Tag the corpus (1) Linguistic filter needs a tagged (POS) corpus Use Brill Tagger or Alpino Parser Corpus: Elsevier's Medical Encyclopedia 21000 sentences 367239 words 11

Step 1: Tag the corpus (2) Example of a record in the corpus: <med> <tw>leukemie</tw> <r>genhe</r> <bt> <media> <nua>me256</nua> <kolbr>1</kolbr>... </media>... <ka>symptomen</ka>... <bt> </med> <a>de verschijnselen van de acute vormen lijken veel op elkaar. Moeheid, bleekheid (als gevolg van bloedarmoede), koorts, keelontsteking, spontane blauwe plekken, lymfeklierzwellingen en vergroting van lever en milt zijn de voornaamste verschijnselen. Bij chronische leukemie is het begin van de ziekte vaak sluipend en duurt het vaak lang voordat de patiënt klachten krijgt. Bij chronische myeloide leukemie bestaan de klachten meestal uit vermoeidheid (als gevolg van <vw><st></st>bloedarmoede</vw>) en een opgezette milt. Bij chronische lymfatische leukemie kunnen tevens de lymfeklieren opgezet zijn en treden er vaak infecties op. </a> a = paragraph (to be extracted for further steps) The record is tagged in an XML structure. Some terms are already annotated with some defined tags, e.g., vw and st. 12

Step 1: Tag the corpus (3) Clean up the sentences and tokenize, for example: MEDENC-12227 Bij chronische myeloide leukemie bestaan de klachten meestal uit vermoeidheid ( als gevolg van bloedarmoede ) en een opgezette milt. Tag the sentence using Brill Tagger (Drenth, 1997): MEDENC-12227 Bij/Prep(voor) chronische/adj(attr,stell,verv_neut) myeloide/n(soort,ev,neut) leukemie/n(soort,ev,neut) bestaan/n (soort,ev,neut) de/art(bep,zijd_of_mv,neut) klachten/n (soort,mv,neut) meestal/adv(gew,geen_func,stell,onverv) uit/prep (voor) vermoeidheid/n(soort,ev,neut) (/Punc(haak_open) als/conj (onder,met_fin) gevolg/n(soort,ev,neut) van/prep(voor) bloedarmoede/n(soort,ev,neut) )/Punc(haak_sluit) en/conj(neven) een/art(onbep,zijd_of_onzijd,neut) opgezette/v (trans,verl_dw,verv_neut) milt/v(trans,ott,3,ev)./punc(punt) parsing error 13

Step 2: Extract candidate strings (1) extract strings of maximum length, after we do not find any string of this length, decrease the length by 1 and make a new attempt, remove the word tags, create a list of strings of each length with their frequency of occurrence, concatenate the lists; the longest string appear at the top. 14

Step 2: Extract candidate strings (2) Filter used in this experiment (Justeson, Katz, 1995): ( (Adj N)+ ( ( (Adj N)* (N Prep)? ) (Adj N)* ) ) N Implementation of the filter in a Perl extended regular expression: m{ ((?:(?:(?:(?:(?: (?:\w+ \/ Adj) (?:\w+ \/ N) )\s )+ (?:(?:(?:(?: (?:\w+ \/ Adj) (?:\w+ \/ N) )\s )* (?: (?:\w+ \/ N) \s (?:\w+ \/ Prep) \s )? ) ( (?:\w+ \/ Adj) (?:\w+ \/ N) )* )))) \w+ \/ N ) }xg 15

Step 2: Extract candidate strings (3) Bij chronische myeloide leukemie bestaan de klachten... Bij chronische myeloide myeloide leukemie bestaan chronische myeloide leukemie chronische myeloide Bij chronische myeloide leukemie... Filter myeloide leukemie bestaan chronische myeloide leukemie myeloide leukemie chronische myeloide 16

Step 2: Extract candidate strings (4) Candidate strings extracted from the example: MEDENC-12227 chronische/adj myeloide/n leukemie/n bestaan/n MEDENC-12227 chronische/adj myeloide/n leukemie/n MEDENC-12227 myeloide/n leukemie/n bestaan/n MEDENC-12227 gevolg/n van/prep bloedarmoede/n MEDENC-12227 chronische/adj myeloide/n MEDENC-12227 myeloide/n leukemie/n MEDENC-12227 leukemie/n bestaan/n MEDENC-12227 myeloide/n MEDENC-12227 leukemie/n MEDENC-12227 bestaan/n MEDENC-12227 klachten/n MEDENC-12227 vermoeidheid/n MEDENC-12227 gevolg/n MEDENC-12227 bloedarmoede/n 17

Step 2: Extract candidate strings (5) Remove the word tags and create a list of the candidate strings with their frequency of occurrence, sort by their length (descending). A list of candidate strings containing the word myeloide Freq Candidate string 1 CHRONISCHE MYELOIDE LEUKEMIE BESTAAN 4 CHRONISCHE MYELOIDE LEUKEMIE 2 ACUTE MYELOIDE LEUKEMIE 1 MYELOIDE LEUKEMIE BESTAAN 8 MYELOIDE LEUKEMIE 4 CHRONISCHE MYELOIDE 2 ACUTE MYELOIDE 8 MYELOIDE Freq >= 3 1089 >= 2 2618 >= 1 17383 # candidate strings 18

Step 3: Calculate the C-value (1) Algorithm (Frantzi et al., 2000): for all strings a of maximum length calculate C-value(a) = log 2 a. f(a); if C-value(a)>= Threshold add a to output list; for all substrings b revise t(b); revise c(b); for all smaller strings a in descending order if a apperas for the first time C-value(a) = log 2 a. f(a) else C-value(a) = log 2 a. (f(a) 1/c(a). t(a)) if C-value(a) >= Threshold add a to output list; for all substrings b revise t(b); revise c(b); 19

Step 3: Calculate the C-value (2) Start with the longest string, e.g.: a = CHRONISCHE MYELOIDE LEUKEMIE BESTAAN a = 4 f(a) = 1 C-value(CHRONISCHE MYELOIDE LEUKEMIE BESTAAN) = log 2 a.f(a) = log 2 4. 1 = 2 20

Step 3: Calculate the C-value (3) For a nested string, e.g.: a = CHRONISCHE MYELOIDE LEUKEMIE a = 3 f(a) = 4 Appear in a longer string: b = CHRONISCHE MYELOIDE LEUKEMIE BESTAAN f(b) = 1 P(Ta)= 1 (number of different longer strings) C-value(CHRONISCHE MYELOIDE LEUKEMIE) = log 2 a f a 1 P Ta b T a f b = log 2 3. (4-1/1) = 4.75 21

Step 3: Calculate the C-value (4) How to determine that myeloide leukemie is a term, while acute myeloide is not? a = MYELOIDE LEUKEMIE a = 2 f(a) = 8 is a substring of some longer strings: f(b): 1 CHRONISCHE MYELOIDE LEUKEMIE BESTAAN 4 CHRONISCHE MYELOIDE LEUKEMIE 2 ACUTE MYELOIDE LEUKEMIE 1 MYELOIDE LEUKEMIE BESTAAN ------------------------------- Σf(b) = 8 P(Ta) = 4 C-value(MYELOIDE LEUKEMIE) = log 2 a f a 1 P Ta b T a f b = log 2 2. (8-8/4) = 6 22

Step 3: Calculate the C-value (5) Now, we calculate the C-value for acute myeloide a = ACUTE MYELOIDE a = 2 f(a) = 2 is a substring of a longer string: f(b): 2 ACUTE MYELOIDE LEUKEMIE Σf(b) = 2 (total frequency of the longer strings) P(Ta) = 1 (number of different longer strings) C-value(ACUTE MYELOIDE) = log 2 a f a 1 P Ta b T a f b = log 2 2. (2-2/1) = 0 23

Step 3: Calculate the C-value (6) Acute myeloide only occurs as a substring of one longer term. Its high degreee of dependence to a term gives negative effect to its termhood. Myeloide leukemie occurs as a substring of many longer terms. This condition shows its high degree of independence, which then decreases the negative effect of being a substring. Intuition: The more often a candidate term occurs alone and longer its size, the higher its termhood. The more often a candidate term occurs as a substring, the lower its termhood. But, the more the number of longer terms in which the candidate term occurs, the higher its termhood. 24

Step 3: Calculate the C-value (7) The C-value of terms containing the word myeloide Rank C-val P(Ta) f(b) f(a) Candidate Term --------------------------------------------------------------------- 267. 6.00 4 8 8 MYELOIDE LEUKEMIE 392. 4.75 1 1 4 CHRONISCHE MYELOIDE LEUKEMIE 607. 3.17 0 0 2 ACUTE MYELOIDE LEUKEMIE 2397. 2.00 0 0 1 CHRONISCHE MYELOIDE LEUKEMIE BESTAAN 4485. 1.50 2 5 4 CHRONISCHE MYELOIDE 15940. 0.00 1 2 2 ACUTE MYELOIDE 16882. 0.00 1 1 1 MYELOIDE LEUKEMIE BESTAAN The C-value of the top 10 candidate terms in the rank Rank C-val P(Ta) f(b) f(a) Candidate Term ------ ------- ----- ----- ----- ------------------- 1. 69.83 6 7 71 DIKKE DARM 2. 62.00 2 2 63 DUNNE DARM 3. 60.78 9 11 62 RODE BLOEDCELLEN 4. 60.70 40 52 62 GROOT AANTAL 5. 55.50 10 15 57 WITTE BLOEDCELLEN 6. 50.00 0 0 50 ERNSTIGE GEVALLEN 7. 48.00 0 0 48 CENTRAAL ZENUWSTELSEL 8. 31.00 1 2 33 VOORNAAMSTE VERSCHIJNSELEN 9. 31.00 1 1 32 HOGE BLOEDDRUK 10. 30.00 0 0 30 BELANGRIJKE ROL 25

Method of Evaluation Gold standard 4950 terms 2638 candidate terms Precision Manual evaluation up to top 500 candidate terms (by Lonneke) 26

Evaluation Precision for 500 candidate terms by C-value and baseline (frequency) Majority of the candidate terms have never been seen as nested. Therefore, C-value treats them in a similar way to frequency. 27

Evaluation (2) Precision for 50 candidate terms that appeared as nested (substring) Number of nested candidate terms: C-value, 58; frequency, 117. The differences were assigned to 0 by the C- value. 28

Evaluation (3) Problem with the C-value in this experiment: C-val P(Ta) f(b) f(a) Candidate term ---------------------------------------------------------- 6.00 0 0 3 FAMILIAIRE ADENOMATEUZE POLYPOSIS COLI 1.00 2 6 4 POLYPOSIS COLI 0.00 1 3 3 FAMILIAIRE ADENOMATEUZE POLYPOSIS 0.00 3 9 3 ADENOMATEUZE POLYPOSIS 0.00 1 3 3 ADENOMATEUZE POLYPOSIS COLI A string which occurs in only one longer string will be assigned to 0 by the C-value, although it is a term. Solution (future work): Incorporate context information, by using part of speech elements of modifiers which appear before or after the candidate strings. 29

Conclusion The linguistic filter is very important because it contributes in the very early selection of candidate terms. Most of the candidate terms from the corpus are never seen as nested. Only 117 of 2638 are nested. Therefore C-value treated them in a similar way to frequency. C-value kicks out nested substrings. These candidate strings decrease the percision of the frequency. The corpus (encyclopedia) is well structured. Results on other corpora can not be claimed to be better or worse before we conduct experiments. 30