Text Mining in den ehumanities Aspects of eaqua and etraces Technische Universität Darmstadt Darmstadt, 2011/07/11 Natural Language Processing Group Department of Computer Science University of Leipzig
eaqua Working methodologies: Humanists: theory driven Computer scientists: model driven Common interest: TEXT 2
eaqua Working methodologies: Humanists (close reading): qualitative (e. g. working with quotations, dictionaries) Computer scientists (distant reading): quantitative (e. g. co-occurrences) 3
ehumanities as interdisciplinary work Humanities Computer Science 4
Agenda Search & find in well investigated corpora Text correction Text re-use 5
Co-occurrence analysis (text mining) semantic spaces 6
Given scenario: How to identify relevant information? How to make research on classical texts if Thousands of researchers already investigated in several centuries texts such as ancient Greek and the amount of texts is relatively closed Computer scientist's way of thinking (Google like): 7
Semantic space of Herodotus 8
Obvious co-occurrences of Herodotus 9
Rare co-occurrences of Herodotus 10
Viewing different semantic spaces I 11
Viewing different semantic spaces II 12
Viewing different semantic spaces III 13
Viewing different semantic spaces IV 14
Viewing different semantic spaces V 15
A brief summary There are different views on semantic spaces: Full view: Typically unusable Student's view vs. Professor's view: PRO: statistically motivated CON: Different meanings/usage of a word (e.g. by geographical separation) are aggregated Semantic spaces: Spatial and temporal separation of semantic spaces PRO: There is NO need for thresholds. PRO: Natural restriction of the semantic space. Deeper graph mining approaches to support exploration of semantic space and focus on interesting details (candidates) 16
Co-occurrence analysis (text mining) Contrastive semantics 17 c
New requirements & motivation Motivation: Knowledge consists of concepts being associated. Especially, historians are interested in knowledge. However: After 100 of years scholarly research a lot of information/knowledge is well-known. Obvious knowledge is well-known. Critical computing for identifying new knowledge: Latent relations. Manual work: constrastive relations vs. Theory of Selective Perception New Requirements: Find associations being unexpected (also statistically) Opposite of introductory example from psycholinguistics research Scoring of relations not by a statistical measure but something fitting better to historians interests 18
A visualisation of the contrastive semantics Source: F. Baumgardt: Visualisierung von Kookkurrenzgraphen. Bachelorarbeit Abteilung Automatische Sprachverarbeitung, Universität Leipzig, 2010. 19
Some examples Relation of (Ἀτθίδος, νομάδας) engl.: (Attika, nomadic people) Context: Historiography, Autochthony is combined with nomadism Importance: νομάδας is the 111. most significant co-occurrence of Ἀτθίδος (hard to find manually) Reviewed by Charlotte Schubert Relation of (Ὀδόμαντοι, πέος) engl.: (Odomastai (a folk in Thrace), penis) Context: Found in an Ancient comedy (Aristophanes, 5 BC) Relation of (κοπρολόγος, ψάλτρια) - engl.: (shit collector, dancing girl) Context: ἀστυνόμοι engl.: (protecting the city, public festivals) Found in Aristotle (4 BC) 20
Text Completion 21
Text source 22
Text Completion By (information) retrieval 23
Text completion today I Source: papyri.info 24
Text completion today II Problem 1: Lots of texts needs to be read. Do you read the first hit in the same unbiased way as the last one? Source: papyri.info 25
Text completion today III Problem 2: Finding the right signal words. Which words are good indicators for text completion? Source: papyri.info 26
Text Completion By (text) mining 27
Task 1: Detection of critical words Detection of words by Leiden Conventions (Source: Wikipedia): [abc]: letters missing from the original text due to lacuna, but restored by the editor <ab>: characters erroneously omitted by the ancient scribe, restored by the editor [[abc]]: deleted letters... 28
Task 1: Correcting words in two steps Finding words that contain an error (Detection of candidates) Ancient texts: Leiden conventions Modern texts: Trust of correctness by likelihood ratio Redundancy: negative information for misspelled or fragmentary words 29
Task 2: Correction of critical words Semantically best word (co-occurrences) Syntactically best word (N-gram) String similar best word (Levenshtein, FastSS) Word length (Stoichedon texts) Best word by domain classification by Mathematics and mechanics Centuries Cities Jurisdiction Slave trading 30
An example What is the ORIGINAL missing word? 31
The text Οὐιβίῳ Ἀλεξά [ν]δρῳ τῷ κρατίστῳ ἐπιστρατήγῳ παρὰ Ἀντ[ωνίου Δ]όμν ου τοῦ καὶ Φιλαντι[νό]ο υ Ἀντωνίο [υ Ῥωμανο]ῦ Τραιανείου τοῦ κα [ὶ Στρα]τ είου Ἀντινοέως. [οὐκ ἂν] εἰς τοῦτο προήχθ [η]ν, ἐ πιτρόπων [μέγιστ]ε, μέ[τριος] καὶ ἀπρά γ μων ὢν ἄνθρ [ωπος,] ε ἰ μὴ [ὓβρι]ν τὴν μ [εγ]ίστην ἐπ επόνθ[ειν ὑπὸ] Ὡρίωνο[ς κ]ω μογρα[μ]μ ατέως Φ[ι]λαδελφεί [ας τῆ]ς Ἡρακλε ί δου μερίδο [ς] τ οῦ Ἀρ σινοίτου. [οὗ χά]ριν μην[ύ]ω παρὰ τ[ὰ ἀ]πειρημένα ἑα [υτὸ]ν ἐνσείσαν τα εἰς τὴν κωμογραμματείαν [μ]ήτε σιτολογήσαντα μήτε πρ [α]κτορεύσαντ α παντελῶς ἄπορον ὄν[τ]α. δι ἣ ν αἰτίαν κ αὶ πρότερον οὐ διέλιπον ἐντυγχά νων καὶ νῦ ν ἀξιῶ, ἐάν σου τῇ τύχῃ δόξ[ῃ], ἀκοῦσ αί μου π[ρ]ὸς αὐτὸν πρὸς τὸ τυχεῖν με τ ῆ ς ἀπὸ σοῦ [μι]σοπονήρου ἐγδ[ι]κίας, ἵν ὦ ὑπὸ [σ]ο ῦ κατὰ π άντα β ε βοηθ(ημένος). διευτύχει Ἀντώνιος Δόμν ο ς ἐπιδέδωκα. 32
Solution several strategies I 33
Solution several strategies II 34
Text Re-use 35
Bird's eye view Humanities ehumanities Digital Humanities Computer Science 36
Text Re-use A computer scientist's point of view 37
A computer scientist thinks... -... in language models -... in pseudo algorithms -... about complexity reduction of algorithms -... about efficient text re-use models 38
Complexity of algorithms Naive method: comparing every sentence with all other sentences TLG: 5,500,000*5,500,000 = 3.025e13 comparisons Assumption: Comparison rate of 1000 sentences/sec. This process would run about 3.025e10 seconds or more than 959 years. Even if we would compare only sentences with all significant phrases we would need about one year. That's why: Usage of divide & conquer strategies Intelligent pre-clustering of data Using occurrences of Plato, work titles or roles of Plato's works Using significant terms of Plato's work 39
Text Re-use and quotations A Digital Humanity's point of view 40
A digital humanist thinks... -... how to store textual data in order to preserve all necessary information such as described by Leiden Conventions, additional comments and so on - Typically a digital humanists tend to use XML for this task like TEI P5, TEI MSS, or EpiDoc -... how to annotate additional information such as person names (fragmentary authors) or locations -... how to combine textual data and mining data (manual or automatized) such as quotations 41
Text re-use: Where to set the start and end tag of an XML annotation? αἱ μῆτραί τε καὶ ὑστέραι λεγόμεναι... αἱ δὲ ἐν ταῖς γυναιξὶν μῆτραί τε καὶ ὑστέραι λεγόμεναι... αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι... αἱ δ' ἐν ταῖς γυναιξὶ μῆτραί τε καὶ ὑστέραι λεγόμεναι... 42
How to annotate iff there are overlaps? I - A simple example (Source; http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-quote.html): Lexicography has shown little sign of being affected by the work of followers of J.R. Firth, probably best summarized in his slogan, You shall know a word by the company it keeps (Firth, 1957). - There are two chances of annotating the quotation: - document internal annotations (often used by humanisists) such as Lexicography has shown little sign of being affected by the work of followers of J.R. Firth, probably best summarized in his slogan, <quote>you shall know a word by the company it keeps</quote> <ref>(firth, 1957)</ref> - document external annotations (mainly used by computer scientist for reasons of readability) such as - in GraphML - Canonical Text Services - RDF 43
How to annotate iff there are overlaps? II - Toy sample sentences: A B <quote source=1>c D E <quote source=2>f G H I J</quote> K L</quote> - Intra document annotations does not work (without XML hacks) - Inter document annotations can deal with those kinds of problems such as CTS, GraphML 44
Text Re-use in 6 steps An ehumanities point of view 45
An e-humanist thinks... - about infrastructure (combining several textual resources) - about Humanities Computing (problem focused improvements of algorithms) 46
6 levels of text re-use - Level 1: Pre-processing - Level 2: Feature training - Level 3: Feature selection (Fingerprinting) - Level 4: Linking - Level 5: Scoring - Level 6: Post-processing 47
Level 6: Post processing I Source (Plot): John Lee: A Computational Model of Text Reuse in Ancient Literary Texts, 2009. The same order is more trustworthy than a sole and highly similar link. 48
Level 6: Post processing A text re-use from a document with a high text re-use coverage is more trustworthy than from a less frequently re-used text. A text re-use from a section of a document with a high text re-use temperature is more trustworthy than from a less frequently re-used part of a document. 49
Accessing & visualisation of text re-use A Humanities' point of view 50
Literal citations: portal II 51
Literal citations: portal III 52
Literal citations: Visualisation I Plato: Timaeus 91b7 ff. αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον παρὰ τὴν ὥραν χρόνον πολὺν γίγνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως συναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον περὶ τὴν ὥραν χρόνον πολὺν γίγνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως ξυναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν περὶ δὲ τῆς μήτρας ὅτι τε ζῷόν ἐστι καὶ αὕτη καὶ τὰ ἀπὸ τοῦ πατρὸς ἐξερχόμενα μόρια ταῦτα πάλιν λέγει Πλάτων αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον παρὰ τὴν ὥραν χρόνον πολὺν γίνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον καὶ ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως ξυναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν 53
Literal citations: Visualisation II Plato: Timaeus 91b7 ff. αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον παρὰ τὴν ὥραν χρόνον πολὺν γίγνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως συναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον περὶ τὴν ὥραν χρόνον πολὺν γίγνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως ξυναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν περὶ δὲ τῆς μήτρας ὅτι τε ζῷόν ἐστι καὶ αὕτη καὶ τὰ ἀπὸ τοῦ πατρὸς ἐξερχόμενα μόρια ταῦτα πάλιν λέγει Πλάτων αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον παρὰ τὴν ὥραν χρόνον πολὺν γίνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον καὶ ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως ξυναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν 54
Likelihood vs. Witnesses (Taxi problem) How trustworthy is a set of reuses for a source? 55
Micro view visualisation 56
Macro view visualisation 57
Macro view: Hands-on solutions Middle Platonism Neoplatonism 58
A brief summary Algorithms 59
Classes of mining tools Unsupervised Supervised Bootstrapping Pattern Manual
Bird's eye view Humanities ehumanities Digital Humanities Computer Science 61