Language Resources for Information Extraction:

Language Resources for Information Extraction: demands and challenges in practice Christos Tsalidis tsalidis@neurolingo.gr Page 1

Different types of LRs Alphabets & Characters sets (Greek, English, Mixed) Electronic dictionaries: Vocabularies (gazetteers) as domain descriptors: person names, company names, places, job titles, etc. Morphological Lexica (lemma vs. word form) Terminological Lexica (term vs. lemma) Thesauri (word sense vs. lemma): synonyms, antonyms, synonym sets Taxonomies & Ontologies: semantic categories and relations, inference rules Spell checking and fuzzy matching for identification of incorrect expressions Grammar Rules: recognition of multi-word expressions and terms, Named Entities, specific events. Page 2

Alphabets Extra information needed in order to complete the functionality provided by OSes and development libraries Letter Definition Letter Class Phoneme Class Page 3

Vocabularies (Gazetteers) Word lists (mostly nouns) of a specific domain (e.g. person names) No need for detailed morphological specification Simple morphological generation rules Lists extracted automatically from customer s legacy systems Page 4

Morphological Dictionary Morphological analysis of the lemma συγχρονικός synchronic ADJ Page 5

The LexEdit Application Page 6

Morphological Entry Page 7

The ThesEdit Application An example of a small dictionary (20 lemmas) of synonyms and antonyms in English Page 8

Thesaurus Entry Page 9

Grammar rules: The Kanon formalism The Kanon (from the Greek word κανών rule ) is a feature-based grammar formalism, which is used for the description and recognition of specific morphosyntactic patterns in text documents. The rules definition uses lexical features such as: full lemma (verb to increase i.e. increase, increases, increased, increasing), word form (increasing), morphosyntactic attributes (noun_sing_nom, verb_pass_pres), morphological attributes (words ending in ing, -ful), orthographic attributes (words starting with capital letter) This formalism constitutes the core component of a number of NLP applications, such as the MNEMOSYNE software used for: multi-word term identification (e.g. in the biomedical domain) Named Entities Recognition (NER) text mining and information extraction grammar checking Page 10

Information extraction: The Mnemosyne system (1/2) Analyzes large volumes of information Input data different formats (HTML, PDF, TXT) stored on various media (file, database, web page) Analyzers Text analyzers extract information from textual sources and generate appropriate semantic annotations. Specialized analyzers ensure the transfer of extracted information to specific destinations and formats (XML, database, etc.). Fuzzy matching analyzers compare the extracted information (named entities: persons, organisations, addresses, dates, etc.) to the data stored in an existing corporate database system, using both lexicographical and statistical mechanisms. Fully customizable process pipeline. Application specific analyzers can be created, if needed. Page 11

Information extraction: The Mnemosyne system (2/2) Different analyzers use various language resources (vocabularies of different languages, spelling and morphological dictionaries, domain-specific dictionaries, thesauri, etc.) and the Kanon rules in order to assign semantic annotation to the extracted information. Page 12

Step 1: Sentence splitting Input text Ł Sentence Analysis Μετά την αντικατάσταση αυτή ηνέα σύνθεση του Διοικητικού Συμβουλίου του οποίου η θητεία λήγει την 26.9.2010 έχει ως κατωτέρω: 1. Κυριάκος Μουρατίδης του Θεοφίλου, που γεννήθηκε στη Θεσσαλονίκη το έτος 1952, κάτοικος Θεσσαλονίκης οδός Πλατεία Ναυαρίνου 3, ως Πρόεδρος και Διευθύνων Σύμβουλος. <span offset="889" length="165"> <contents> 1. Κυριάκος Μουρατίδης του Θεοφίλου, που γεννήθηκε στη Θεσσαλονίκη το έτος 1952, κάτοικος Θεσσαλονίκης οδός Πλατεία Ναυαρίνου 3, ως Πρόεδρος και Διευθύνων Σύμβουλος. </contents> <annotations> <tag name="sseqno">3</tag> </annotations> </span> Page 13

Step 2: Tokenization & Lexical Identification <annotations> <tag name="ttext" >Κυριάκος</tag> <tag name="vocabs">pfname+psname</tag> <tag name="lexy">{κυριάκος,masc+n+nom+sing}</tag> <tag name="ortho">nrwrd+fcwrd+wthltrs</tag> </annotations> <annotations> <tag name="ttext">μουρατίδης</tag> <tag name="vocabs">psname</tag> <tag name="lexy"/> <tag name="ortho">nrwrd+fcwrd+wthltrs</tag> </annotations> <annotations> <tag name="ttext">θεοφίλου</tag> <tag name="vocabs >PFName+PSName</tag> <tag name="lexy >{Θεόφιλος,GEN+MASC+N+SING}</tag> <tag name="ortho">nrwrd+fcwrd+wthltrs</tag> </annotations> Μετά την αντικατάσταση αυτή ηνέα σύνθεση του Διοικητικού Συμβουλίου του οποίου η θητεία λήγει την 26.9.2010 έχει ως κατωτέρω: 1. Κυριάκος Μουρατίδης του Θεοφίλου, που γεννήθηκε στη Θεσσαλονίκη το έτος 1952, κάτοικος Θεσσαλονίκης οδός Πλατεία Ναυαρίνου 3, ως Πρόεδρος και Διευθύνων Σύμβουλος. Page 14

Step 3: Named Entities Recognition [IRULE="PERSON_3_1", TTEXT=TagPerson("PERSON_3_1","PERSON","%n%s%f",$x1,$x2,$x3)] => \ [TTEXT==$x1, ORTHO->AnyOfOAttrs([FcWrd,AcWrd]), LEXY->HasNoneMAttrs([ART]), VOCABS->AnyAndNoneOfVocabs([PFName],[PExcept])], [TTEXT==$x2, ORTHO->AnyOfOAttrs([FcWrd,AcWrd]), VOCABS->NoneOfVocabs([PExcept])], [TTEXT=="του"], [TTEXT==$x3, ORTHO->AnyOfOAttrs([FcWrd,AcWrd]), VOCABS->NoneOfVocabs([PExcept])] / ; Μετά την αντικατάσταση αυτή ηνέα σύνθεση του Διοικητικού Συμβουλίου του οποίου ηθητεία λήγει την 26.9.2010 έχει ως κατωτέρω: 1. Κυριάκος Μουρατίδης του Θεοφίλου, που γεννήθηκε στη Θεσσαλονίκη το έτος 1952, κάτοικος Θεσσαλονίκης οδός Πλατεία Ναυαρίνου 3, ως Πρόεδρος και Διευθύνων Σύμβουλος. <span offset="892" length="32"> <contents> Κυριάκος Μουρατίδης του Θεοφίλου </contents> <annotations> <tag name="ttext">person</tag> <tag name="irule">person_3_1</tag> </annotations> </span> Page 15

Thank you for your attention! http://www.neurolingo.com Page 16