Software-gestütztes Arbeiten mit Historischen Texten Text Mining in den Geisteswissenschaften Text Mining Applications

Σχετικά έγγραφα
ΚΥΠΡΙΑΚΟΣ ΣΥΝΔΕΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY 21 ος ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ Δεύτερος Γύρος - 30 Μαρτίου 2011

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Other Test Constructions: Likelihood Ratio & Bayes Tests

Statistical Inference I Locally most powerful tests

EE512: Error Control Coding

Elements of Information Theory

ΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ. ΘΕΜΑ: «ιερεύνηση της σχέσης µεταξύ φωνηµικής επίγνωσης και ορθογραφικής δεξιότητας σε παιδιά προσχολικής ηλικίας»

Solutions to the Schrodinger equation atomic orbitals. Ψ 1 s Ψ 2 s Ψ 2 px Ψ 2 py Ψ 2 pz

C.S. 430 Assignment 6, Sample Solutions

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

Fractional Colorings and Zykov Products of graphs

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

Τ.Ε.Ι. ΔΥΤΙΚΗΣ ΜΑΚΕΔΟΝΙΑΣ ΠΑΡΑΡΤΗΜΑ ΚΑΣΤΟΡΙΑΣ ΤΜΗΜΑ ΔΗΜΟΣΙΩΝ ΣΧΕΣΕΩΝ & ΕΠΙΚΟΙΝΩΝΙΑΣ

Example EpiDoc Markup of an Olbian inscription. Gabriel Bodard King's College London

English PDFsharp is a.net library for creating and processing PDF documents 'on the fly'. The library is completely written in C# and based

English PDFsharp is a.net library for creating and processing PDF documents 'on the fly'. The library is completely written in C# and based

(C) 2010 Pearson Education, Inc. All rights reserved.

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Math 6 SL Probability Distributions Practice Test Mark Scheme

Numerical Analysis FMN011

derivation of the Laplacian from rectangular to spherical coordinates

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

ST5224: Advanced Statistical Theory II

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

TMA4115 Matematikk 3

The Simply Typed Lambda Calculus

PARTIAL NOTES for 6.1 Trigonometric Identities

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

About these lecture notes. Simply Typed λ-calculus. Types

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

ΑΠΟΔΟΤΙΚΗ ΑΠΟΤΙΜΗΣΗ ΕΡΩΤΗΣΕΩΝ OLAP Η ΜΕΤΑΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ ΕΞΕΙΔΙΚΕΥΣΗΣ. Υποβάλλεται στην

2 Composition. Invertible Mappings

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

14 Lesson 2: The Omega Verb - Present Tense

Dr. D. Dinev, Department of Structural Mechanics, UACEG

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

the total number of electrons passing through the lamp.

Galatia SIL Keyboard Information

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Inverse trigonometric functions & General Solution of Trigonometric Equations

5.4 The Poisson Distribution.

Homomorphism in Intuitionistic Fuzzy Automata

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

ΕΠΙΧΕΙΡΗΣΙΑΚΗ ΑΛΛΗΛΟΓΡΑΦΙΑ ΚΑΙ ΕΠΙΚΟΙΝΩΝΙΑ ΣΤΗΝ ΑΓΓΛΙΚΗ ΓΛΩΣΣΑ

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 11/3/2006

Code Breaker. TEACHER s NOTES

Lecture 2. Soundness and completeness of propositional logic

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Section 9.2 Polar Equations and Graphs

Investigating the fuzzy areas of accuracy and confidence of muslim pupils- learners of Greek as Second Language in Thrace, Greece

Η ΠΡΟΣΩΠΙΚΗ ΟΡΙΟΘΕΤΗΣΗ ΤΟΥ ΧΩΡΟΥ Η ΠΕΡΙΠΤΩΣΗ ΤΩΝ CHAT ROOMS

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS International General Certificate of Secondary Education

Example Sheet 3 Solutions

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ. ΕΠΛ342: Βάσεις Δεδομένων. Χειμερινό Εξάμηνο Φροντιστήριο 10 ΛΥΣΕΙΣ. Επερωτήσεις SQL

The challenges of non-stable predicates

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

ΣΧΕΔΙΑΣΜΟΣ ΔΙΚΤΥΩΝ ΔΙΑΝΟΜΗΣ. Η εργασία υποβάλλεται για τη μερική κάλυψη των απαιτήσεων με στόχο. την απόκτηση του διπλώματος

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Το κοινωνικό στίγμα της ψυχικής ασθένειας

Matrices and Determinants

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

Instruction Execution Times

- S P E C I A L R E P O R T - EMPLOYMENT. -January Source: Cyprus Statistical Service

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

ΜΟΝΤΕΛΑ ΛΗΨΗΣ ΑΠΟΦΑΣΕΩΝ

Assalamu `alaikum wr. wb.

Congruence Classes of Invertible Matrices of Order 3 over F 2

Section 7.6 Double and Half Angle Formulas

Strain gauge and rosettes

Business English. Ενότητα # 9: Financial Planning. Ευαγγελία Κουτσογιάννη Τμήμα Διοίκησης Επιχειρήσεων

ΠΕΡΙΕΧΟΜΕΝΑ. Μάρκετινγκ Αθλητικών Τουριστικών Προορισμών 1

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Homework 8 Model Solution Section

ίκτυο προστασίας για τα Ελληνικά αγροτικά και οικόσιτα ζώα on.net e-foundatio // itute: toring Insti SAVE-Monit

Modern Greek Extension

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

Η αλληλεπίδραση ανάμεσα στην καθημερινή γλώσσα και την επιστημονική ορολογία: παράδειγμα από το πεδίο της Κοσμολογίας

Homework 3 Solutions

Πώς μπορεί κανείς να έχει έναν διερμηνέα κατά την επίσκεψή του στον Οικογενειακό του Γιατρό στο Ίσλινγκτον Getting an interpreter when you visit your

Finite Field Problems: Solutions

TRACER - Preprocessing

Πτυχιακή Εργασία. Παραδοσιακά Προϊόντα Διατροφική Αξία και η Πιστοποίηση τους

Reminders: linear functions

Test Data Management in Practice

The Probabilistic Method - Probabilistic Techniques. Lecture 7: The Janson Inequality

Exercises to Statistics of Material Fatigue No. 5

ΣΤΥΛΙΑΝΟΥ ΣΟΦΙΑ

Transcript:

Software-gestütztes Arbeiten mit Historischen Texten Text Mining in den Geisteswissenschaften Text Mining Applications Martin-Luther-Universität Halle/S. Halle/S., 2011/03/02 Natural Language Processing Group Department of Computer Science University of Leipzig

Agenda Applications Text completion By retrieval By text mining Graph applications: Text to Social Networks Conclusion incl. Preparation for oral examination 2

Text Completion 3

Text source 4

Text Completion By (information) retrieval 5

Text completion today I Source: papyri.info 6

Text completion today II Problem 1: Lots of texts needs to be read. Do you read the first hit in the same unbiased way as the last one? Source: papyri.info 7

Text completion today III Problem 2: Finding the right signal words. Which words are good indicators for text completion? Source: papyri.info 8

Text Completion By (text) mining 9

Spell checking today 10

Spell checking some results (Source: c tausgabe 23, 2007) 11

Spell checking A brief overview Simple algorithm tasks Detection of wrong words Correction of wrong words Approaches: Semantics Syntax String similarity Some German examples Inter word error: Das am 19. Oktober erscheinende Album firmiertnämlich wieder unter EAV. Real word error: Er viel auf den Boden und verletzt sich dabei. Non word error: Was solche Berufspolitiker wie Herr Brüdele mit ihrer Praxisferne sagen, das ist doch nur zum Lachen. Abbreviations: Auf zwei Sätze mit einem Zeitlimit von max. 40 Minuten wurde in der Vorrunde am Samstag sowie am Sonntagvormittag gespielt. 12

Pre-processing of text and training of data Currently processed corpora: TLG, PHI7, PHI7_INS, PHI_DDP, epiduke Pre-processing: All texts are segmented into sentences. Meta information such as dating or classification are extracted. Tokenisation Training: Features (e. g. signal words ) for every word are pre-computed in the background (up to 100s of millions datasets) Features are classified by different approaches Scoring the overall list: Main idea/assumption: Every known word in a corpus is a potential candidate for text completion. That means: TLG about 1.7M words, epiduke about 550T words Every approach delivers an independent list of candidates having a score between 0 and 1. Overall candidate list is scored by the sum of a word's individual score by a selected algorithm 13

Methodology 14

Task 1: Detection of critical words Detection of words by Leiden Conventions (Source: Wikipedia): [abc]: letters missing from the original text due to lacuna, but restored by the editor <ab>: characters erroneously omitted by the ancient scribe, restored by the editor [[abc]]: deleted letters... 15

Task 1: Finding candidates V.3 Erased and lost <del rend="erasure"><gap reason="lost" quantity="3"/></del> [...] V.3 Erased and lost [ ] <del rend="erasure"><gap reason="lost" quantity="5"/></del> [ V.3... c.5 ] Erased and lost <del rend="erasure"><gap reason="lost" extent="unknown"/></del> [---] VI.1 Text struck over erasure <add place="overstrike">αβγ</add> abc VI.1 Overstruck text, incomprehensible <add place="overstrike"><orig>αβγ</orig></add> ABC VI.1 Overstruck text ambiguous <add place="overstrike"><unclear>αβγ</unclear></add> abc VI.2 Overstruck text, lost but restored <add place="overstrike"><supplied reason="lost">αβγ</supplied></add> Gabriel Bodard (et al.), (2006-2009), _EpiDoc Cheat Sheet: Krummrey-Panciera sigla & EpiDoc tags_, [abc] version 1085, accessed: 2010-07-04. available VI.3 Overstruck text, completely <add place="overstrike"><gap reason="lost" quantity="3" unit="character"/></add> <http://epidoc.svn.sourceforge.net/viewvc/epidoc/trunk/guidelines/msword/cheatsheet.doc> lost [...] 16

Task 1: Correcting words in two steps Finding words that contain an error (Detection of candidates) Ancient texts: Leiden conventions Modern texts: Trust of correctness by likelihood ratio Redundancy: negative information for misspelled or fragmentary words 17

Task 2: Correction of critical words Semantically best word (co-occurrences) Syntactically best word (N-gram) String similar best word (Levenshtein, FastSS) Word length (Stoichedon texts) Best word by domain classification by Mathematics and mechanics Centuries Cities Jurisdiction Slave trading 18

Task 2: String similarity Partly damaged words can be reconstructed by computing the most string similar word. Algorithms: Levenshtein FastSS 19

Task 2: Classification data Live Demonstration. 20

Finding best fitting word Toy sample: A b C d <E> G h. Semantic approach: Features: sentences based co-occurrences (function words filtered) Toy sample: A, C, G are selected as semantic profile Looking for words that have the best overlap with the semantic profile (all permutations are possible) Real world example: <E>=Τροία: Τροίας, Τρωΐα, Τροίᾳ, Τροίαν, Τροίη,Ἴλιος, Ἴλιος, Ἴλου Syntactical approach: Method: Looking for immediately neighboured words (bi-gram level) Toy sample: d, G are selected as features Word similarity: Method: letter bi-gram overlapping (word) Real word examples: γίγνηται and γίνηται or συναγαγόντες and ξυναγαγόντες Named Entity list and word length 21

Which approach for what? Semantic approach: If a word occurs typically in static semantic context If a word occurs in a context that have a significant amount of content words Syntactical approach: If a word occurs in a quite static syntactic pattern If a word is part of a multi word expressions like King Alexander the Great Word similarity: If a misspelled or fragmentary word still has enough recovered letters Free of any sense: words with e.g. 2 letters. NO RESTRICTION Word length If word length is known Named Entity list If it makes sense to restrict the candidate list to person names (e.g. deletion). 22

An example What is the ORIGINAL missing word? 23

The text Οὐιβίῳ Ἀλεξά [ν]δρῳ τῷ κρατίστῳ ἐπιστρατήγῳ παρὰ Ἀντ[ωνίου Δ]όμν ου τοῦ καὶ Φιλαντι[νό]ο υ Ἀντωνίο [υ Ῥωμανο]ῦ Τραιανείου τοῦ κα [ὶ Στρα]τ είου Ἀντινοέως. [οὐκ ἂν] εἰς τοῦτο προήχθ [η]ν, ἐ πιτρόπων [μέγιστ]ε, μέ[τριος] καὶ ἀπρά γ μων ὢν ἄνθρ [ωπος,] ε ἰ μὴ [ὓβρι]ν τὴν μ [εγ]ίστην ἐπ επόνθ[ειν ὑπὸ] Ὡρίωνο[ς κ]ω μογρα[μ]μ ατέως Φ[ι]λαδελφεί [ας τῆ]ς Ἡρακλε ί δου μερίδο [ς] τ οῦ Ἀρ σινοίτου. [οὗ χά]ριν μην[ύ]ω παρὰ τ[ὰ ἀ]πειρημένα ἑα [υτὸ]ν ἐνσείσαν τα εἰς τὴν κωμογραμματείαν [μ]ήτε σιτολογήσαντα μήτε πρ [α]κτορεύσαντ α παντελῶς ἄπορον ὄν[τ]α. δι ἣ ν αἰτίαν κ αὶ πρότερον οὐ διέλιπον ἐντυγχά νων καὶ νῦ ν ἀξιῶ, ἐάν σου τῇ τύχῃ δόξ[ῃ], ἀκοῦσ αί μου π[ρ]ὸς αὐτὸν πρὸς τὸ τυχεῖν με τ ῆ ς ἀπὸ σοῦ [μι]σοπονήρου ἐγδ[ι]κίας, ἵν ὦ ὑπὸ [σ]ο ῦ κατὰ π άντα β ε βοηθ(ημένος). διευτύχει Ἀντώνιος Δόμν ο ς ἐπιδέδωκα. 24

Solution several strategies I 25

Solution several strategies II 26

Text to Social Networks 27

Starting Point: Panionion 28

Small World Definition/Motivation : What's the average path length in a graph? Average path length is typically not larger than7. Simple proof of concept (Using XING): Every person of my contacts has in average about 73 contacts (1. and 2. level) Log (6,800,000,000)= 5,28 73 Small world property (Milgram) 29

Methodology 30

Several graph properties Argumentation trail properties Graph properties w_id>=100 w_id>=300 w_id>=500 Complete graph && freq(word) && freq(word) && freq(word) >1 >1 >1 Named Entities Normalised Named Entities Normalised Text and Named Entities 538,572 388,929 363,359 353,618 1,149 4,487 2,178 57,762,474 34,818,138 25,615,956 21,004,538 15,436 126,188 152,856 30,382,422 21,739,476 17,687,582 15,462,940 14,876 69,858 84,124 0.53 0.62 0.69 0.74 0.96 0.55 0.55 Average degree 56.41 55.90 48.68 43.73 12.95 15.57 38.62 Number of trails > 108 > 108 > 108 > 108 361.094 7.958.240 3.087.581 Average degree 15.34 9.93 7.70 6.79 7.03 7.77 9.93 Average degree of internal node (trail length 2) 31.34 21.08 14.33 11.45 7.02 10.15 12.31 301.38 362.56 285.86 231.39 55.66 76.06 81.86 Number of nodes Number of cooccurrences Number of significant co-occurrences Percentage Average degree of internal node (trail length 3) 31

Visualisation of two argumentation trails 32

Association chains: From Platon to Alexander the Great? Social network built by co-occurrence analysis on TLG corpus 33