Software-gestütztes Arbeiten mit Historischen Texten Text Mining in den Geisteswissenschaften. Text Re-use, Knowledge Transfer, and Applications

Σχετικά έγγραφα
Textual Re-use on Ancient Greek Texts: A case study on Plato's works

Text Mining in den ehumanities. Aspects of eaqua and etraces

etraces/etrap: Computational Aspects of Historical Text Re-use

Collec&ng Fragmentary Authors in a Digital Library

the total number of electrons passing through the lamp.

2 Composition. Invertible Mappings

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Approximation of distance between locations on earth given by latitude and longitude

Math 6 SL Probability Distributions Practice Test Mark Scheme

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

Software-gestütztes Arbeiten mit Historischen Texten Text Mining in den Geisteswissenschaften Text Mining Applications

EE512: Error Control Coding

Test Data Management in Practice

Démographie spatiale/spatial Demography

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΕΙΡΑΙΑ ΤΜΗΜΑ ΝΑΥΤΙΛΙΑΚΩΝ ΣΠΟΥΔΩΝ ΠΡΟΓΡΑΜΜΑ ΜΕΤΑΠΤΥΧΙΑΚΩΝ ΣΠΟΥΔΩΝ ΣΤΗΝ ΝΑΥΤΙΛΙΑ

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Fractional Colorings and Zykov Products of graphs

The Simply Typed Lambda Calculus

C.S. 430 Assignment 6, Sample Solutions

Other Test Constructions: Likelihood Ratio & Bayes Tests

Assalamu `alaikum wr. wb.

5.4 The Poisson Distribution.

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Potential Dividers. 46 minutes. 46 marks. Page 1 of 11

Lecture 2. Soundness and completeness of propositional logic

Section 8.3 Trigonometric Equations

Problem Set 3: Solutions

Επιβλέπουσα Καθηγήτρια: ΣΟΦΙΑ ΑΡΑΒΟΥ ΠΑΠΑΔΑΤΟΥ

Language Resources for Information Extraction:

ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ - ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΕΠΛ 133: ΑΝΤΙΚΕΙΜΕΝΟΣΤΡΕΦΗΣ ΠΡΟΓΡΑΜΜΑΤΙΣΜΟΣ ΕΡΓΑΣΤΗΡΙΟ 3 Javadoc Tutorial

Statistical Inference I Locally most powerful tests

The challenges of non-stable predicates

ΕΠΙΧΕΙΡΗΣΙΑΚΗ ΑΛΛΗΛΟΓΡΑΦΙΑ ΚΑΙ ΕΠΙΚΟΙΝΩΝΙΑ ΣΤΗΝ ΑΓΓΛΙΚΗ ΓΛΩΣΣΑ

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

derivation of the Laplacian from rectangular to spherical coordinates

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

ΚΥΠΡΙΑΚΟΣ ΣΥΝΔΕΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY 21 ος ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ Δεύτερος Γύρος - 30 Μαρτίου 2011

Models for Probabilistic Programs with an Adversary

[1] P Q. Fig. 3.1

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

HISTOGRAMS AND PERCENTILES What is the 25 th percentile of a histogram? What is the 50 th percentile for the cigarette histogram?

Every set of first-order formulas is equivalent to an independent set

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

ST5224: Advanced Statistical Theory II

Προσομοίωση BP με το Bizagi Modeler

Συντακτικές λειτουργίες

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

Gerhard Heyer Universität Leipzig

Congruence Classes of Invertible Matrices of Order 3 over F 2

Section 9.2 Polar Equations and Graphs

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Section 7.6 Double and Half Angle Formulas

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Η αλληλεπίδραση ανάμεσα στην καθημερινή γλώσσα και την επιστημονική ορολογία: παράδειγμα από το πεδίο της Κοσμολογίας

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

Elements of Information Theory

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

Advanced Subsidiary Unit 1: Understanding and Written Response

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΤΜΗΜΑ ΝΟΣΗΛΕΥΤΙΚΗΣ

Section 1: Listening and responding. Presenter: Niki Farfara MGTAV VCE Seminar 7 August 2016

2nd Training Workshop of scientists- practitioners in the juvenile judicial system Volos, EVALUATION REPORT

Capacitors - Capacitance, Charge and Potential Difference

Notes on the Open Economy

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ ΠΟΛΥΤΕΧΝΙΚΗ ΣΧΟΛΗ ΤΜΗΜΑ ΜΗΧΑΝΙΚΩΝ Η/Υ & ΠΛΗΡΟΦΟΡΙΚΗΣ. του Γεράσιμου Τουλιάτου ΑΜ: 697

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

Bayesian modeling of inseparable space-time variation in disease risk

Finite Field Problems: Solutions

Αλγόριθμοι και πολυπλοκότητα NP-Completeness (2)

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

14 Lesson 2: The Omega Verb - Present Tense

Modbus basic setup notes for IO-Link AL1xxx Master Block

EPL 603 TOPICS IN SOFTWARE ENGINEERING. Lab 5: Component Adaptation Environment (COPE)

Example Sheet 3 Solutions

ΓΗΠΛΧΜΑΣΗΚΖ ΔΡΓΑΗΑ ΑΡΥΗΣΔΚΣΟΝΗΚΖ ΣΧΝ ΓΔΦΤΡΧΝ ΑΠΟ ΑΠΟΦΖ ΜΟΡΦΟΛΟΓΗΑ ΚΑΗ ΑΗΘΖΣΗΚΖ

Μηχανική Μάθηση Hypothesis Testing

ΠΑΡΑΜΕΤΡΟΙ ΕΠΗΡΕΑΣΜΟΥ ΤΗΣ ΑΝΑΓΝΩΣΗΣ- ΑΠΟΚΩΔΙΚΟΠΟΙΗΣΗΣ ΤΗΣ BRAILLE ΑΠΟ ΑΤΟΜΑ ΜΕ ΤΥΦΛΩΣΗ

4.6 Autoregressive Moving Average Model ARMA(1,1)

ΦΥΛΛΟ ΕΡΓΑΣΙΑΣ Α. Διαβάστε τις ειδήσεις και εν συνεχεία σημειώστε. Οπτική γωνία είδησης 1:.

Η ΔΙΑΣΤΡΕΥΛΩΣΗ ΤΗΣ ΕΛΛΗΝΙΚΗΣ ΓΛΩΣΣΑΣ ΜΕΣΩ ΤΩΝ SOCIAL MEDIA ΤΗΝ ΤΕΛΕΥΤΑΙΑ ΠΕΝΤΑΕΤΙΑ ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ ΤΗΣ ΑΝΑΣΤΑΣΙΑΣ-ΜΑΡΙΝΑΣ ΔΑΦΝΗ

Instruction Execution Times

Modern Greek Extension

Solutions to Exercise Sheet 5

Homework 8 Model Solution Section

Ανάκτηση Πληροφορίας

(1) Describe the process by which mercury atoms become excited in a fluorescent tube (3)

Matrices and Determinants

Lecture 34 Bootstrap confidence intervals

ΑΠΟΔΟΤΙΚΗ ΑΠΟΤΙΜΗΣΗ ΕΡΩΤΗΣΕΩΝ OLAP Η ΜΕΤΑΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ ΕΞΕΙΔΙΚΕΥΣΗΣ. Υποβάλλεται στην

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ "ΠΟΛΥΚΡΙΤΗΡΙΑ ΣΥΣΤΗΜΑΤΑ ΛΗΨΗΣ ΑΠΟΦΑΣΕΩΝ. Η ΠΕΡΙΠΤΩΣΗ ΤΗΣ ΕΠΙΛΟΓΗΣ ΑΣΦΑΛΙΣΤΗΡΙΟΥ ΣΥΜΒΟΛΑΙΟΥ ΥΓΕΙΑΣ "

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 6η: Basics of Industrial Organization Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

The Probabilistic Method - Probabilistic Techniques. Lecture 7: The Janson Inequality

Pg The perimeter is P = 3x The area of a triangle is. where b is the base, h is the height. In our case b = x, then the area is

Transcript:

Software-gestütztes Arbeiten mit Historischen Texten Text Mining in den Geisteswissenschaften Text Re-use, Knowledge Transfer, and Applications Martin-Luther-Universität Halle/S. Halle/S., 2011/01/27 Natural Language Processing Group Department of Computer Science University of Leipzig

Agenda Text re-use from different points of view.. of a computer scientist, of a digital humanist, of an e-humanist and of a humanist 6 levels of text re-use or quotation mining Some bird's eye view results and obvious problems Fragmentary authors 2

Introduction 3

More General Knowledge Transfer (Biology) Similarity of branches of the same knowledge, knowledge is changing over time 4

More General Knowledge Transfer (Archaeology) Functionality of objects vs. object of interest, critical amount of reuse 5

More General Knowledge Transfer (Further aspects) Software: (PRO): Modular programming, (CON): Legal right problems Music: (CON): Plagiarism Images: (PRO): Object recognition Properties of Text Reuse: Efficiency, cost reduction Plagiarism (least effort) vs. knowledge transfer (modern: research, ancient: philosophical debate) Level of text modification Properties of Ancient Text Reuse: Fragmentary texts Fragmentary authors In e. g. medieval ages: Monks copy texts (sometimes with errors) Ancient authors reused text passages. However including modifications: Language evolution dialects 6

Definitions/Terminology Citation/quotation Modern: (SOURCE, <TEXT>) Ancient: mostly (, <TEXT>) Plagiarism Modern: mostly (, <TEXT>) Text Reuse Legal right aspects are ignored: For this reason: (<TEXT>) Literal citations Parallel texts Paraphrases Text Reuse graph G=(V,E), V is set of sentences, E set of links between elements of V (Hyper-textual structure in a Digital Library) 7

Language vs. Communication 8

Different research interests Humanities ehumanities Digital Humanities Computer Science 9

Text Re-use A computer scientist's point of view 10

A computer scientist thinks... -... in language models -... in pseudo algorithms -... about complexity reduction of algorithms -... about efficient text re-use models 11

Complexity of algorithms Naive method: comparing every sentence with all other sentences TLG: 5,500,000*5,500,000 = 3.025e13 comparisons Assumption: Comparison rate of 1000 sentences/sec. This process would run about 3.025e10 seconds or more than 959 years. Even if we would compare only sentences with all significant phrases we would need about one year. That's why: Usage of divide & conquer strategies Intelligent pre-clustering of data Using occurrences of Plato, work titles or roles of Plato's works Using significant terms of Plato's work 12

Pseudo algorithm for Text Reuse 1 2 V = segment_corpus(c) with v1, v2,..., vn V, vi=c and vi vj for each vi V 3 4 Fi=train_features(vi); for each vi V 5 6 7 8 9 Training for each fk Fi ei=(vi,vj) E=select all vj containing feature fk Linking for each ei E si=scoring(ei=(vi,vj) E; Fi; Fj); if(s <threshold){e=e\{e }} i Scoring i 13

Types of Completeness of Text Re-use algorithms Extraction of fragmentary authors String approaches: GST Letter n-grams Syntactic approaches: Longest Common Consecutive Words Word n-grams Distance based co-occurrences Semantic approaches: Semantic clustering Semantic graph based approach(es) Latent relations Radius retrieval More complex approaches: DCT 14

Literal quotations: Longest Common Consecutive Words αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν... 1. Step: Iterative training of possible n-grams candidates (Training) αἱ δ' αἱ δ' ἐν αἱ δ' ἐν ταῖς αἱ δ' ἐν ταῖς γυναιξὶν... 2. Removing all n-grams having a smaller frequency of 2 and having less than 5 words (Selection) 3. Removing prefixes and suffixes 4. Creating inverted list for the above detected n-grams (Mapping n-gram to sentence) 5. Collect all sentences having same n-grams (Citation candidates) 6. Compute similarity by word overlap (Dice) 15

Semantic quotations: a graph based approach I Basic assumption: Two words being semantically similar have the same co-occurrences Example: laptop: {mouse, battery, display, portable} computer: {mouse, keyboard, display, mainframe} Algorithm (simplification): Compute co-occurrences for all words Compare co-occurrence profiles of two words Compute intersection I of co-occurrence profiles of both words Compute union U of co-occurrence profiles of both words Compute ratio sim=card(i)/card(u) Example (continued): I={mouse, display} U={mouse, battery, display, portable, keyboard, mainframe} sim=card(i)/card(u)=2/6=1/3 16

Semantic quotations: a graph based approach II 17

Semantic quotations: a graph based approach III Toy sample corpus 1. Copy from one, it's plagiarism; copy from two, it's research. 2. Plagiarism is not the same as copyright infringement. 3. Plagiarism is to to copy from one but to copy from two is research. 1. Step: Co-occurrence analysis 3. Step: Graph based similarity Intersection 2. Step: (copy,from, 1.0) (copy,from, 1.0) (copy,from) (copy,one, 1.0) (copy,one, 1.0) (copy,one) (copy,it's, 1.0) (copy,it's, 1.0) (copy,it's) (copy,plagiarism, 0.8) (copy,plagiarism, 0.8) (copy,plagiarism) (copy,research, 1.0) (copy,research, 1.0) (copy,research)......... (plagiarism,from, 0.8) (plagiarism,from, 0.8) (plagiarism,from) (plagiarism,one, 0.8) (plagiarism,one, 0.8) (plagiarism,one) (plagiarism,it's, 0.8) (plagiarism,it's, 0.8) (plagiarism,it's) (plagiarism,copy, 0.8) (plagiarism,copy, 0.8) (plagiarism,copy) (plagiarism,research, 0.8) (plagiarism,research)... (plagiarism,research), 0.8... (copy,copyright, 0.1) 4. Step: Selection Selection... (plagiarism,copyright) (plagiarism,infringement) 18

Shannon's Noisy Channel Theorem vs. Kolmogorow Complexity Shannon: Kolmogorow: Text A Text A' Min. Program P Research question: Dissimilarity of A und A' in order to become LINKED. If LINKED: Which min progs make no sense? Halstead metric and McCabe metric 19

Text Re-use and quotations A Digital Humanity's point of view 20

A digital humanist thinks... -... how to store textual data in order to preserve all necessary information such as described by Leiden Conventions, additional comments and so on - Typically a digital humanists tend to use XML for this task like TEI P5, TEI MSS, or EpiDoc -... how to annotate additional information such as person names (fragmentary authors) or locations -... how to combine textual data and mining data (manual or automatized) such as quotations 21

Text re-use: Where to set the start and end tag of an XML annotation? αἱ μῆτραί τε καὶ ὑστέραι λεγόμεναι... αἱ δὲ ἐν ταῖς γυναιξὶν μῆτραί τε καὶ ὑστέραι λεγόμεναι... αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι... αἱ δ' ἐν ταῖς γυναιξὶ μῆτραί τε καὶ ὑστέραι λεγόμεναι... 22

How to annotate iff there are overlaps? I - A simple example (Source; http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-quote.html): Lexicography has shown little sign of being affected by the work of followers of J.R. Firth, probably best summarized in his slogan, You shall know a word by the company it keeps (Firth, 1957). - There are two chances of annotating the quotation: - document internal annotations (often used by humanisists) such as Lexicography has shown little sign of being affected by the work of followers of J.R. Firth, probably best summarized in his slogan, <quote>you shall know a word by the company it keeps</quote> <ref>(firth, 1957)</ref> - document external annotations (mainly used by computer scientist for reasons of readability) such as - in GraphML - Canonical Text Services - RDF 23

How to annotate iff there are overlaps? II - Toy sample sentences: A B <quote source=1>c D E <quote source=2>f G H I J</quote> K L</quote> - Intra document annotations does not work (without XML hacks) - Inter document annotations can deal with those kinds of problems such as CTS, GraphML 24

Text Re-use in 6 steps An ehumanities point of view 25

An e-humanist thinks... - about infrastructure (combining several textual resources) - about Humanities Computing (problem focused improvements of algorithms) 26

6 levels of text re-use - Level 1: Pre-processing - Level 2: Feature training - Level 3: Feature selection (Fingerprinting) - Level 4: Linking - Level 5: Scoring - Level 6: Post-processing 27

Level 1: Pre-processing - Capitalisation (e. g. all letters to lowercase) - Normalisation (e. g. removing all diacritics) - Lemmatisation (e. g. replace inflected words by baseform) - Synonym replacements (e. g. replace a word by the most common (most frequent) synonym) - String similarity (words that are similar written) Result: Cleaned text 28

Level 2: Training A general overview - training means to identify the feature of a re-use unit such as - letter level (1. generation text re-use models, starting from 1970s) such as - letter n-gram features - lexical level (2. generation, within 1990s) - syntactial such as word n-gram features - semantical such word features 29

Level 2: Syntactical training details Syntactical feature N-gram feature Overlapping Shingling Longest Common Consecutive Words Property of overlapping features Non overlapping Local hash breaking Property of constant or variable n-gram length Global hash breaking 30

Level 3: Selection of training data building a re-use fingerprint - local selection strategies work with the knowledge within a re-use unit such as - Local 0 mod p (e. g. position of a word within a re-use unit) - random selection - Winnowing - global selection strategies work with global knowledge such as a word list of the entire corpus like - Global 0 mod p (e. g. rank of a word (cf. Zipfian law) ) - Selection of special word classes such as nouns - Inverted Document Frequency (IDF) score - Minimum feature frequency selection - Maximum feature frequency selection 31

Level 4: Linking types comparing re-use units Intra corpus detection (Text reuse): Inter corpus detection (Modern: Plagiarism, Ancient: e.g. bible): 32

Level 5: Scoring Similarity Word similarity Word resemblance Word containment Level of similarity Feature similarity Feature resemblance Feature containment Further measures: - Levenshtein distance - Log likelihood ratio - many more 33 Symmetric vs. asymmetric measures

Level 6: Post processing I Source (Plot): John Lee: A Computational Model of Text Reuse in Ancient Literary Texts, 2009. The same order is more trustworthy than a sole and highly similar link. 34

Level 6: Post processing A text re-use from a document with a high text re-use coverage is more trustworthy than from a less frequently re-used text. A text re-use from a section of a document with a high text re-use temperature is more trustworthy than from a less frequently re-used part of a document. 35

Accessing & visualisation of text re-use A Humanities' point of view 36

Literal citations: portal II 37

Literal citations: portal III 38

Literal citations: Visualisation I Plato: Timaeus 91b7 ff. αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον παρὰ τὴν ὥραν χρόνον πολὺν γίγνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως συναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον περὶ τὴν ὥραν χρόνον πολὺν γίγνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως ξυναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν περὶ δὲ τῆς μήτρας ὅτι τε ζῷόν ἐστι καὶ αὕτη καὶ τὰ ἀπὸ τοῦ πατρὸς ἐξερχόμενα μόρια ταῦτα πάλιν λέγει Πλάτων αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον παρὰ τὴν ὥραν χρόνον πολὺν γίνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον καὶ ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως ξυναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν 39

Literal citations: Visualisation II Plato: Timaeus 91b7 ff. αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον παρὰ τὴν ὥραν χρόνον πολὺν γίγνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως συναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον περὶ τὴν ὥραν χρόνον πολὺν γίγνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως ξυναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν περὶ δὲ τῆς μήτρας ὅτι τε ζῷόν ἐστι καὶ αὕτη καὶ τὰ ἀπὸ τοῦ πατρὸς ἐξερχόμενα μόρια ταῦτα πάλιν λέγει Πλάτων αἱ δ' ἐν ταῖς γυναιξὶν αὖ μῆτραί τε καὶ ὑστέραι λεγόμεναι διὰ τὰ αὐτὰ ταῦτα ζῷον ἐπιθυμητικὸν ἐνὸν τῆς παιδοποιίας ὅταν ἄκαρπον παρὰ τὴν ὥραν χρόνον πολὺν γίνηται χαλεπῶς ἀγανακτοῦν ϕέρει καὶ πλανώμενον πάντῃ κατὰ τὸ σῶμα τὰς τοῦ πνεύματος διεξόδους ἀποϕράττον καὶ ἀναπνεῖν οὐκ ἐῶν εἰς ἀπορίας τὰς ἐσχάτας ἐμβάλλει καὶ νόσους παντοδαπὰς ἄλλας παρέχει μέχριπερ ἂν ἑκατέρων ἡ ἐπιθυμία καὶ ὁ ἔρως ξυναγαγόντες οἷον ἀπὸ δένδρων καρπὸν καταδρέψαντες ὡς εἰς ἄρουραν τὴν μήτραν ἀόρατα ὑπὸ σμικρότητος καὶ ἀδιάπλαστα ζῷα κατασπείραντες καὶ πάλιν διακρίναντες μεγάλα ἐντὸς ἐκθρέψωνται καὶ μετὰ τοῦτο εἰς ϕῶς ἀγαγόντες ζῴων ἀποτελέσωσι γένεσιν 40

Likelihood vs. Witnesses (Taxi problem) How trustworthy is a set reuses for a source? 41

Micro view visualisation 42

Macro view visualisation 43

Macro view: Hands-on solutions Middle Platonism Neoplatonism 44

A brief summary Algorithms 45

Some results 46

Literal citations: Similarity thresholds Similarity distribution 20,00% 18,00% 16,00% Percentage of all references 14,00% 12,00% 10,00% Similarity 8,00% 6,00% 4,00% 2,00% 0,00% 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 Similarity score of textual references 47 1

Similarity thresholds for some authors Similarity distribution of textual references separated by authors 100 90 80 70 PLATO (4. BC) PLUTARCHUS (2. AD) GALENUS (2. AD) PORPHYRIUS (3. AD) CYRILLUS (5. AD) EUSEBIUS (4. AD) THEODORETUS (5. AD) STOBAEUS (5. AD) PROCLUS (5. AD) JOANNES PHILOPONUS (6. AD) SIMPLICIUS (6. AD) Frequency 60 50 40 30 20 10 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Similarity of textual references First result: As time passes, text reuse by later authors becomes much less literal. 48

Shannon's Noisy Channel Theorem Main challenge: How to model the noisy channel? OR Extract and systematise relevant noisy like Evolutionary changes: Language evolution Dialect changes Paraphrases Editorial changes Other changes: Fragmentary words Reordering of sentence structure 49

Distance in time vs. Text re-use similarity: Aristoteles & Plato Source: Maria Moritz: Informationsextraktion in den Altertumswissenschaften Fragmentarische Autoren - Extraktion altgriechischer Eigennamenund Belegstellen auf antiken Texten, 2011.

Within different text sorts (genre) and time slices there are not the same re-use styles!! 51

Problems with language model and text re-use 52

20 years of Fall after the Berlin Wall (Leipzig, Berlin) We are the people! 53

Some results problem focused What is a good similarity threshold (literal citations)? Dissimilarity vs. Fragments Plato: Low threshold provides good results as well Atthidographers: Poor quality precision less than 20% Multi word expressions like King Alexander the Great (literal citations) Phrases τοῦ Κυρίου ἡμῶν Ἰησοῦ Χριστοῦ (Engl.: in the Name of Our Lord Jesus Christ) Again: We are the people! Editorial references to publications Works in different editions - Embedded text reuse (relation between linking and scoring) - Detection boundaries of text reuse We are the people! 54

Evaluation Most critical point since there is no gold standard on ancient texts but... Modern texts: PAN corpus (literal text reuse), MeTeR corpus (semantic text reuse),? Ancient texts: Own gold standard by Extraction of text fragments (method for fragmentary authors) Extraction of editorial references Manual annotations by our researchers Negative evaluation by several randomised corpora 55

Basic research questions: some interdisciplinary aspects Syntactic level (text reuse): Taxi problem: Likelihood vs. witnesses How often a text passage have to be reused to exist nowadays? (similar to Gene) Modelling the noisy channel Semantic level (knowledge transfer): Stability vs. reuse vs. significance Volatility as reuse killer Minimum reuse unit (Testing distributional semantics assumption) Critical point for completeness of documents (interest)? Further research questions: Relation between text reuse and gnomology Influence of different kind of normalisations of text 56

Fragementary authors 57

What is a fragment? (Oxford English Dictionary, s.v. fragment) a part broken off or otherwise detached from a whole a part remaining or still preserved when the whole is lost or destroyed an extant portion of a writing or composition which as a whole is lost a portion of a work left uncompleted by its author Berti Büchler, Fragmentary Texts and Digital Collections of Fragmentary Authors

Different kinds of fragments material textual fragments fragments Berti Büchler, Fragmentary Texts and Digital Collections of Fragmentary Authors

material fragments material fragments = physical remains of ancient evidence reconstruction of the monument Berti Büchler, Fragmentary Texts and Digital Collections of Fragmentary Authors

textual fragments (1) textual fragments = material fragments bearing textual evidence surviving broken off pieces of ancient writings Berti Büchler, Fragmentary Texts and Digital Collections of Fragmentary Authors

Task 1: Workflow person name extraction Step 1: Extraction of candidates by pattern such as VN VN VN ETH VN LOC Pattern Step 2: Resolving morphological dependencies using Morpheus Step 3: Statistical evidence criterion Pattern Unsupervised Step 4: Generating a similarity graph of those candidates and building valid concept classes Unsupervised Step 5: Applying validated patterns on text in order to extract less frequent occurrences Step 6: Iterating step 2-5 Supervised Bootstrapping Berti Büchler, Fragmentary Texts and Digital Collections of Fragmentary Authors

Task 1: Some results of the PN extractor Step 1: Extraction of candidates by pattern such as Ἑλλάνικος Λέσβιος (VN ETH) Step 2: Resolving morphological dependencies - Removing candidates like Ἑλλάνικος Ἀκουσιλάῳ VN VN Step 3: Statistical evidence criterion like min freq is 4. Step 4: Generating a similarity graph of those candidates and building valid concept classes e.g. Ἑλλάνικος Λέσβιος (VN ETH) Ἑλλάνικος ὁ Λέσβιος (VN ZN ETH) Step 5: Applying validated patterns on text in order to extract less frequent occurrences Ἑλλάνικός τε ὁ Λέσβιος Ἑλλάνικος δὲ ὁ Λέσβιός Λέσβιος Ἑλλάνικος... Overall after 1 iteration 16 different versions of Hellanicus of Lesbos Berti Büchler, Fragmentary Texts and Digital Collections of Fragmentary Authors

Task 2: Extraction of fragments: Role of named entities Argumentation trail properties Graph properties w_id>=100 w_id>=300 w_id>=500 Complete graph && freq(word) && freq(word) && freq(word) >1 >1 >1 Named Entities Normalised Named Entities Normalised Text and Named Entities 538,572 388,929 363,359 353,618 1,149 4,487 2,178 57,762,474 34,818,138 25,615,956 21,004,538 15,436 126,188 152,856 30,382,422 21,739,476 17,687,582 15,462,940 14,876 69,858 84,124 0.53 0.62 0.69 0.74 0.96 0.55 0.55 Average degree 56.41 55.90 48.68 43.73 12.95 15.57 38.62 Number of trails > 108 > 108 > 108 > 108 361.094 7.958.240 3.087.581 Average degree 15.34 9.93 7.70 6.79 7.03 7.77 9.93 Average degree of internal node (trail length 2) 31.34 21.08 14.33 11.45 7.02 10.15 12.31 301.38 362.56 285.86 231.39 55.66 76.06 81.86 Number of nodes Number of cooccurrences Number of significant co-occurrences Percentage Average degree of internal node (trail length 3) Berti Büchler, Fragmentary Texts and Digital Collections of Fragmentary Authors

Some results: Occurrence of author's name as fix points TLG v0.1 number of words sum of frequencies: Rank Word 741 Πλάτων 1555 Πλάτωνος 3122 Πλάτωνα 3612 Πλάτωνι 22617 Πλάτων 31711 Πλατωνικῶν 49238 Πλάτωνος 53525 ΠΛΑΤΩΝΟΣ 62353 Πλάτωνα 63178 Πλατωνικὸν 63986 Πλατωνικοὶ 69635 Πλατωνικῆς 71604 Πλατωνικὸς 73838 ΠΛΑΤΩΝ 75004 Πλάτωνός 77550 Πλάτωνι 78821 Πλατωνικοῦ 83219 Πλάτωνά TLG v0.2 155 16312 Freq. Rank Word 8059 692 Πλάτων 3495 1554 Πλάτωνος 1671 3220 Πλάτωνα 1418 3794 Πλάτωνι 205 75374 Πλάτωνός 141 85272 Πλάτωνά 84 141061 Πλάτωνοϲ 76 145464 Πλάτωνί 63 166139 _Πλάτων 62 187866 Πλάτωνες 61 241920 Πλάτωνας 55 699357 Πλάτων_καὶ 53 699358 Πλάτωνο 51 1066134 Πλάτων 50 1066135 Πλάτων_βούλεται 48 1066136 Πλάτων_τὸν 47 1066137 Πλάτων_ἐν 44 1066138 Πλάτων_ὁ TLG v0.3 26 16092 Freq. Rank Word 8730 689 Πλάτων 3813 1556 Πλάτωνος 1803 3226 Πλάτωνα 1534 3786 Πλάτωνι 55 75460 Πλάτωνός 47 85351 Πλάτωνά 24 141115 Πλάτωνοϲ 23 145498 Πλάτωνί 19 187888 Πλάτωνες 16 241916 Πλάτωνας 11 695698 Πλάτωνο 2 1044813 Πλάτωνε 2 1044814 Πλάτωνάς 1 1 1 1 1 65 13 16093 Freq. 8758 3813 1804 1538 55 47 24 23 16 11 2 1 1

Task 2: Extraction of fragments: Possible ways? Option 1: Statistical based Option 2: Pattern based Option 3: Completely different? Supervised Pattern Unsupervised Berti Büchler, Fragmentary Texts and Digital Collections of Fragmentary Authors

Workflow extraction of citations Step 1: Buliding significant patterns including author name, which is nominative and define a classification - FN V REF - FN PRE TIT V - FN PAR PRE TIT REF FN V PRE PAR TIT REF > > > > > > Pattern firstname verb preposition particle work reference Step 2: Resolving morphological dependencies between noun and verb Unsupervised Step 3: Validate the patterns against a standardized fragmentary author Supervised Step 4: Adapt patterns Pattern Source: Maria Moritz: Informationsextraktion in den Altertumswissenschaften Fragmentarische Autoren - Extraktion altgriechischer Eigennamen und Belegstellen auf antiken Texten, 2011.

Results of the citations extractor Step 1: Building significant patterns and a classification - Πλάτων ἔφη ( Phaed. 60b ) (FN V REF) FN V PRE PAR > TIT REF > Πλάτων (Plato) > ἔφη, φησι (wrote) > ἐν (in) καὶ (like) > Πρωταγόρᾳ (Protagoras) > ( Phaed. 60B ) Step 2: Resolving morphological dependencies between verb and noun - Removing candidates like Ἰησοῦν ἔφη (FN[acc] V[nom] ) Step 3: Validate patterns against a standardized fragmentary author - remove/improve patterns which don't work well FN PRE TIT (no known titel borders ) - use known titles to improve Step 4: Adapt patterns Source: Maria Moritz: Informationsextraktion in den Altertumswissenschaften Fragmentarische Autoren - Extraktion altgriechischer Eigennamen und Belegstellen auf antiken Texten, 2011.

Again: textual fragments Athenaeus, Deipnosophistai 10.67 (447c) Ἑλλάνικος δ ἐν Κτίσεσι καὶ ἐκ ῥιζῶν, φησι κατασκευάζεται τὸ βρῦτον γράφων ὧδε πίνουσι δέ βρῦτον ἔκ τινων ῥιζῶν, καθάπερ οἱ Θρᾷκες ἐκ τῶν κριθῶν. Ἑκαταῖος δ ἐν δευτέρῳ Περιηγήσεως εἰπὼν περὶ Αἰγυπτίων ὡς ἀρτοφάγοι εἰσὶν ἐπιφέρει τάς κριθάς ἐς τὸ πῶμα καταλέουσιν. ἐν δέ τῇ τῆς Εὐρώπης περιόδῳ Παίονάς φησι πίνειν βρῦτον ἀπὸ τῶν κριθῶν καὶ παραβίην ἀπὸ κέγχρου καὶ κόνυζαν. ἀλείφονται δέ, φησίν, ἐλαίῳ ἀπὸ γάλακτος. καὶ ταῦτα μέν ταύτῃ. Hellanicus in The Foundings says that beer is made also of rye; he writes as follows: They drink beer made of rye, as the Thracians drink it made of barley. Hecataeus, in the second book of his Description, after saying of the Egyptians that they were bread-eaters, continues: They grind up the barley to make the drink. And in The Description of Europe he says that the Paeonians drink a beer made from barley, also parabias, made from millet, and even fleabane. They also anoint themselves, he says, with an oil made from milk. So much for that. (trans. Gulick) textual fragments = quotations of lost works embedded into other texts Berti Büchler, Fragmentary Texts and Digital Collections of Fragmentary Authors

Why was it reused? They drink beer made of rye, as the Thracians drink it made of barley. the Paeonians drink a beer made from barley, also parabias, made from millet, and even fleabane. They also anoint themselves, he says, with an oil made from milk. Some significance related properties: tf.idf: Except Thracian and Paeonians all other words have a term weight of 0 (function words) or are weak content words. Difference analysis: no discriminating words Log-likelihood ratio: no discriminating words Dale Chall Readability Index: [6.59;9.36] AVG: 7.85 (level of 9th - 10th grade of a secondary school) Is there any measurable content in this fragments? Berti Büchler, Fragmentary Texts and Digital Collections of Fragmentary Authors

Why was it reused? They drink beer made of rye, as the Thracians drink it made of barley. the Paeonians drink a beer made from barley, also parabias, made from millet, and even fleabane. They also anoint themselves, he says, with an oil made from milk. Dissimilarities in the contextual usage (TLG): (milk,oil): 72% (fleabane, millet): 92%, (parabias, millet): 97%, (fleabane, parabias): 94%, (barley, fleabane): 94%,... (rye, barley): 80% Berti Büchler, Fragmentary Texts and Digital Collections of Fragmentary Authors