Software-gestütztes Arbeiten mit Historischen Texten Text Mining in den Geisteswissenschaften Text Mining Applications Martin-Luther-Universität Halle/S. Halle/S., 2011/03/02 Natural Language Processing Group Department of Computer Science University of Leipzig
Agenda Applications Text completion By retrieval By text mining Graph applications: Text to Social Networks Conclusion incl. Preparation for oral examination 2
Text Completion 3
Text source 4
Text Completion By (information) retrieval 5
Text completion today I Source: papyri.info 6
Text completion today II Problem 1: Lots of texts needs to be read. Do you read the first hit in the same unbiased way as the last one? Source: papyri.info 7
Text completion today III Problem 2: Finding the right signal words. Which words are good indicators for text completion? Source: papyri.info 8
Text Completion By (text) mining 9
Spell checking today 10
Spell checking some results (Source: c tausgabe 23, 2007) 11
Spell checking A brief overview Simple algorithm tasks Detection of wrong words Correction of wrong words Approaches: Semantics Syntax String similarity Some German examples Inter word error: Das am 19. Oktober erscheinende Album firmiertnämlich wieder unter EAV. Real word error: Er viel auf den Boden und verletzt sich dabei. Non word error: Was solche Berufspolitiker wie Herr Brüdele mit ihrer Praxisferne sagen, das ist doch nur zum Lachen. Abbreviations: Auf zwei Sätze mit einem Zeitlimit von max. 40 Minuten wurde in der Vorrunde am Samstag sowie am Sonntagvormittag gespielt. 12
Pre-processing of text and training of data Currently processed corpora: TLG, PHI7, PHI7_INS, PHI_DDP, epiduke Pre-processing: All texts are segmented into sentences. Meta information such as dating or classification are extracted. Tokenisation Training: Features (e. g. signal words ) for every word are pre-computed in the background (up to 100s of millions datasets) Features are classified by different approaches Scoring the overall list: Main idea/assumption: Every known word in a corpus is a potential candidate for text completion. That means: TLG about 1.7M words, epiduke about 550T words Every approach delivers an independent list of candidates having a score between 0 and 1. Overall candidate list is scored by the sum of a word's individual score by a selected algorithm 13
Methodology 14
Task 1: Detection of critical words Detection of words by Leiden Conventions (Source: Wikipedia): [abc]: letters missing from the original text due to lacuna, but restored by the editor <ab>: characters erroneously omitted by the ancient scribe, restored by the editor [[abc]]: deleted letters... 15
Task 1: Finding candidates V.3 Erased and lost <del rend="erasure"><gap reason="lost" quantity="3"/></del> [...] V.3 Erased and lost [ ] <del rend="erasure"><gap reason="lost" quantity="5"/></del> [ V.3... c.5 ] Erased and lost <del rend="erasure"><gap reason="lost" extent="unknown"/></del> [---] VI.1 Text struck over erasure <add place="overstrike">αβγ</add> abc VI.1 Overstruck text, incomprehensible <add place="overstrike"><orig>αβγ</orig></add> ABC VI.1 Overstruck text ambiguous <add place="overstrike"><unclear>αβγ</unclear></add> abc VI.2 Overstruck text, lost but restored <add place="overstrike"><supplied reason="lost">αβγ</supplied></add> Gabriel Bodard (et al.), (2006-2009), _EpiDoc Cheat Sheet: Krummrey-Panciera sigla & EpiDoc tags_, [abc] version 1085, accessed: 2010-07-04. available VI.3 Overstruck text, completely <add place="overstrike"><gap reason="lost" quantity="3" unit="character"/></add> <http://epidoc.svn.sourceforge.net/viewvc/epidoc/trunk/guidelines/msword/cheatsheet.doc> lost [...] 16
Task 1: Correcting words in two steps Finding words that contain an error (Detection of candidates) Ancient texts: Leiden conventions Modern texts: Trust of correctness by likelihood ratio Redundancy: negative information for misspelled or fragmentary words 17
Task 2: Correction of critical words Semantically best word (co-occurrences) Syntactically best word (N-gram) String similar best word (Levenshtein, FastSS) Word length (Stoichedon texts) Best word by domain classification by Mathematics and mechanics Centuries Cities Jurisdiction Slave trading 18
Task 2: String similarity Partly damaged words can be reconstructed by computing the most string similar word. Algorithms: Levenshtein FastSS 19
Task 2: Classification data Live Demonstration. 20
Finding best fitting word Toy sample: A b C d <E> G h. Semantic approach: Features: sentences based co-occurrences (function words filtered) Toy sample: A, C, G are selected as semantic profile Looking for words that have the best overlap with the semantic profile (all permutations are possible) Real world example: <E>=Τροία: Τροίας, Τρωΐα, Τροίᾳ, Τροίαν, Τροίη,Ἴλιος, Ἴλιος, Ἴλου Syntactical approach: Method: Looking for immediately neighboured words (bi-gram level) Toy sample: d, G are selected as features Word similarity: Method: letter bi-gram overlapping (word) Real word examples: γίγνηται and γίνηται or συναγαγόντες and ξυναγαγόντες Named Entity list and word length 21
Which approach for what? Semantic approach: If a word occurs typically in static semantic context If a word occurs in a context that have a significant amount of content words Syntactical approach: If a word occurs in a quite static syntactic pattern If a word is part of a multi word expressions like King Alexander the Great Word similarity: If a misspelled or fragmentary word still has enough recovered letters Free of any sense: words with e.g. 2 letters. NO RESTRICTION Word length If word length is known Named Entity list If it makes sense to restrict the candidate list to person names (e.g. deletion). 22
An example What is the ORIGINAL missing word? 23
The text Οὐιβίῳ Ἀλεξά [ν]δρῳ τῷ κρατίστῳ ἐπιστρατήγῳ παρὰ Ἀντ[ωνίου Δ]όμν ου τοῦ καὶ Φιλαντι[νό]ο υ Ἀντωνίο [υ Ῥωμανο]ῦ Τραιανείου τοῦ κα [ὶ Στρα]τ είου Ἀντινοέως. [οὐκ ἂν] εἰς τοῦτο προήχθ [η]ν, ἐ πιτρόπων [μέγιστ]ε, μέ[τριος] καὶ ἀπρά γ μων ὢν ἄνθρ [ωπος,] ε ἰ μὴ [ὓβρι]ν τὴν μ [εγ]ίστην ἐπ επόνθ[ειν ὑπὸ] Ὡρίωνο[ς κ]ω μογρα[μ]μ ατέως Φ[ι]λαδελφεί [ας τῆ]ς Ἡρακλε ί δου μερίδο [ς] τ οῦ Ἀρ σινοίτου. [οὗ χά]ριν μην[ύ]ω παρὰ τ[ὰ ἀ]πειρημένα ἑα [υτὸ]ν ἐνσείσαν τα εἰς τὴν κωμογραμματείαν [μ]ήτε σιτολογήσαντα μήτε πρ [α]κτορεύσαντ α παντελῶς ἄπορον ὄν[τ]α. δι ἣ ν αἰτίαν κ αὶ πρότερον οὐ διέλιπον ἐντυγχά νων καὶ νῦ ν ἀξιῶ, ἐάν σου τῇ τύχῃ δόξ[ῃ], ἀκοῦσ αί μου π[ρ]ὸς αὐτὸν πρὸς τὸ τυχεῖν με τ ῆ ς ἀπὸ σοῦ [μι]σοπονήρου ἐγδ[ι]κίας, ἵν ὦ ὑπὸ [σ]ο ῦ κατὰ π άντα β ε βοηθ(ημένος). διευτύχει Ἀντώνιος Δόμν ο ς ἐπιδέδωκα. 24
Solution several strategies I 25
Solution several strategies II 26
Text to Social Networks 27
Starting Point: Panionion 28
Small World Definition/Motivation : What's the average path length in a graph? Average path length is typically not larger than7. Simple proof of concept (Using XING): Every person of my contacts has in average about 73 contacts (1. and 2. level) Log (6,800,000,000)= 5,28 73 Small world property (Milgram) 29
Methodology 30
Several graph properties Argumentation trail properties Graph properties w_id>=100 w_id>=300 w_id>=500 Complete graph && freq(word) && freq(word) && freq(word) >1 >1 >1 Named Entities Normalised Named Entities Normalised Text and Named Entities 538,572 388,929 363,359 353,618 1,149 4,487 2,178 57,762,474 34,818,138 25,615,956 21,004,538 15,436 126,188 152,856 30,382,422 21,739,476 17,687,582 15,462,940 14,876 69,858 84,124 0.53 0.62 0.69 0.74 0.96 0.55 0.55 Average degree 56.41 55.90 48.68 43.73 12.95 15.57 38.62 Number of trails > 108 > 108 > 108 > 108 361.094 7.958.240 3.087.581 Average degree 15.34 9.93 7.70 6.79 7.03 7.77 9.93 Average degree of internal node (trail length 2) 31.34 21.08 14.33 11.45 7.02 10.15 12.31 301.38 362.56 285.86 231.39 55.66 76.06 81.86 Number of nodes Number of cooccurrences Number of significant co-occurrences Percentage Average degree of internal node (trail length 3) 31
Visualisation of two argumentation trails 32
Association chains: From Platon to Alexander the Great? Social network built by co-occurrence analysis on TLG corpus 33