TRACER - Preprocessing

Σχετικά έγγραφα
TRACER - Preprocessing

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

[1] P Q. Fig. 3.1

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Instruction Execution Times

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

ΚΥΠΡΙΑΚΟΣ ΣΥΝΔΕΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY 21 ος ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ Δεύτερος Γύρος - 30 Μαρτίου 2011

Fractional Colorings and Zykov Products of graphs

Advanced Subsidiary Unit 1: Understanding and Written Response

the total number of electrons passing through the lamp.

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

The Simply Typed Lambda Calculus

Section 1: Listening and responding. Presenter: Niki Farfara MGTAV VCE Seminar 7 August 2016

Living and Nonliving Created by: Maria Okraska

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Math 6 SL Probability Distributions Practice Test Mark Scheme

2 Composition. Invertible Mappings

Potential Dividers. 46 minutes. 46 marks. Page 1 of 11

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

Statistical Inference I Locally most powerful tests

Section 8.3 Trigonometric Equations

Κατανοώντας και στηρίζοντας τα παιδιά που πενθούν στο σχολικό πλαίσιο

Section 9.2 Polar Equations and Graphs

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

derivation of the Laplacian from rectangular to spherical coordinates

Models for Probabilistic Programs with an Adversary

Lecture 2. Soundness and completeness of propositional logic

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

Code Breaker. TEACHER s NOTES

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΤΜΗΜΑ ΝΟΣΗΛΕΥΤΙΚΗΣ

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Every set of first-order formulas is equivalent to an independent set

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

4.6 Autoregressive Moving Average Model ARMA(1,1)

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Capacitors - Capacitance, Charge and Potential Difference

ΑΓΓΛΙΚΑ Ι. Ενότητα 7α: Impact of the Internet on Economic Education. Ζωή Κανταρίδου Τμήμα Εφαρμοσμένης Πληροφορικής

The challenges of non-stable predicates

Η ΨΥΧΙΑΤΡΙΚΗ - ΨΥΧΟΛΟΓΙΚΗ ΠΡΑΓΜΑΤΟΓΝΩΜΟΣΥΝΗ ΣΤΗΝ ΠΟΙΝΙΚΗ ΔΙΚΗ

LESSON 14 (ΜΑΘΗΜΑ ΔΕΚΑΤΕΣΣΕΡΑ) REF : 202/057/34-ADV. 18 February 2014

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

Homework 3 Solutions

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 11/3/2006

SPEEDO AQUABEAT. Specially Designed for Aquatic Athletes and Active People

Calculating the propagation delay of coaxial cable

9.09. # 1. Area inside the oval limaçon r = cos θ. To graph, start with θ = 0 so r = 6. Compute dr

Elements of Information Theory

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS International General Certificate of Secondary Education

EE512: Error Control Coding

Block Ciphers Modes. Ramki Thurimella

TMA4115 Matematikk 3

ίκτυο προστασίας για τα Ελληνικά αγροτικά και οικόσιτα ζώα on.net e-foundatio // itute: toring Insti SAVE-Monit

Matrices and Determinants

Paper Reference. Paper Reference(s) 1776/04 Edexcel GCSE Modern Greek Paper 4 Writing. Thursday 21 May 2009 Afternoon Time: 1 hour 15 minutes

Πώς μπορεί κανείς να έχει έναν διερμηνέα κατά την επίσκεψή του στον Οικογενειακό του Γιατρό στο Ίσλινγκτον Getting an interpreter when you visit your

«ΨΥΧΙΚΗ ΥΓΕΙΑ ΚΑΙ ΣΕΞΟΥΑΛΙΚΗ» ΠΑΝΕΥΡΩΠΑΪΚΗ ΕΡΕΥΝΑ ΤΗΣ GAMIAN- EUROPE

Συστήματα Διαχείρισης Βάσεων Δεδομένων

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

Démographie spatiale/spatial Demography

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

Example of the Baum-Welch Algorithm

* * GREEK 0543/02 Paper 2 Reading and Directed Writing May/June 2009

14 Lesson 2: The Omega Verb - Present Tense

Η ΠΡΟΣΩΠΙΚΗ ΟΡΙΟΘΕΤΗΣΗ ΤΟΥ ΧΩΡΟΥ Η ΠΕΡΙΠΤΩΣΗ ΤΩΝ CHAT ROOMS

ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ - ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΕΠΛ 133: ΑΝΤΙΚΕΙΜΕΝΟΣΤΡΕΦΗΣ ΠΡΟΓΡΑΜΜΑΤΙΣΜΟΣ ΕΡΓΑΣΤΗΡΙΟ 3 Javadoc Tutorial

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Review Test 3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Solutions to the Schrodinger equation atomic orbitals. Ψ 1 s Ψ 2 s Ψ 2 px Ψ 2 py Ψ 2 pz

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ. ΕΠΛ342: Βάσεις Δεδομένων. Χειμερινό Εξάμηνο Φροντιστήριο 10 ΛΥΣΕΙΣ. Επερωτήσεις SQL

(1) Describe the process by which mercury atoms become excited in a fluorescent tube (3)

How to register an account with the Hellenic Community of Sheffield.

ΦΥΛΛΟ ΕΡΓΑΣΙΑΣ Α. Διαβάστε τις ειδήσεις και εν συνεχεία σημειώστε. Οπτική γωνία είδησης 1:.

Galatia SIL Keyboard Information

EPL 603 TOPICS IN SOFTWARE ENGINEERING. Lab 5: Component Adaptation Environment (COPE)

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Writing for A class. Describe yourself Topic 1: Write your name, your nationality, your hobby, your pet. Write where you live.

Εγκατάσταση λογισμικού και αναβάθμιση συσκευής Device software installation and software upgrade

Section 7.6 Double and Half Angle Formulas

Οδηγίες Αγοράς Ηλεκτρονικού Βιβλίου Instructions for Buying an ebook

(C) 2010 Pearson Education, Inc. All rights reserved.

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

Αναερόβια Φυσική Κατάσταση

5.4 The Poisson Distribution.

Cambridge International Examinations Cambridge International General Certificate of Secondary Education

Right Rear Door. Let's now finish the door hinge saga with the right rear door

Στο εστιατόριο «ToDokimasesPrinToBgaleisStonKosmo?» έξω από τους δακτυλίους του Κρόνου, οι παραγγελίες γίνονται ηλεκτρονικά.

CS 150 Assignment 2. Assignment 2 Overview Opening Files Arrays ( and a little bit of pointers ) Strings and Comparison Q/A

DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.

ΓΕΩΠΟΝΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ ΤΜΗΜΑ ΑΓΡΟΤΙΚΗΣ ΟΙΚΟΝΟΜΙΑΣ & ΑΝΑΠΤΥΞΗΣ

Στεγαστική δήλωση: Σχετικά με τις στεγαστικές υπηρεσίες που λαμβάνετε (Residential statement: About the residential services you get)

Πέτρος Γ. Οικονομίδης Πρόεδρος και Εκτελεστικός Διευθυντής

Συντακτικές λειτουργίες

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

Transcript:

TRACER - Preprocessing Marco Büchler etrap Research Group Göttingen Centre for Digital Humanities Institute of Computer Science Georg August University Göttingen, Germany

Overview What is preprocessing? Overview of preprocessing techniques Hacking Conclusion with some test questions

Reminder: Current approach

Pre-step: Segmentation - an example

Pre-step: Segmentation

Naïve (computational) method Compare every text chunk (like sentence) with each other. TLG: 5,500,000*5,500,000 = 3.025e13 comparisons Assumption: Comparison rate of 1000 sentences/sec. This process would run about 3.025e10 seconds or more than 959 years.

Naïve vs. advanced techniques Naïve method: This process would run about 3.025e10 seconds or more than 959 years. Advanced techniques: Can break this down to 1-2 processor month for TLG. Simulations on Big Data have shown (given squared complexity) that...... by increasing the amount of data of billions words, we would need several 100 processor centuries of computational power.

Question What do you associated with preprocessing?

Foundations for preprocessing Zipfian Law

Implications of the Zipfian Law Approx. 50% of all words occur only once Approx. 16% of all words occur only twice Approx. 8% of all words occur three times... Approx. 90% of all words in a corpus occur 10 times or less The top 300 700 most frequent words cover already about 50% of all tokens (depending language)

Question What does lemmatisation mean for this plot?

Preprocessing

Question What does lemmatisation mean for this plot?

Preprocessing: Directed Graph Normalisation e.g. lemmatisation

Preprocessing: Indirected Graph Normalisation e.g. synonyms, string similarity

Hacking - Installation guide for TRACER 1) Download Tracer from http://etrap.gcdh.de/hackathon/tracer/ 2) In the command line or terminal go to your download folder by typing cd (change directory). Example: cd Downloads. 3) Unzip tracer.tar.gz and change to the TRACER folder 4) Start the tool with the command: java -Xmx600m -Dfile.encoding=UTF-8 -Dde.gcdh.medusa.config.ClassConfig=conf/tracer_config.xml -jar tracer.jar Explanation: -Xmx600m (up to 600 MB memory), -Dfile.encoding sets the encoding of your input file (optionally), -Dde.gcdh.medusa.config.ClassConfig (configuration file)

Configuration

Hacking Tasks: Run the KJV text... 1)... without preprocessing 2)... 1) + with removing diachritics 3)... 2) + lower case option 4)... 3) + lemmatisation 5)... 4) + synonym handling 6)... 3) + string similarity normalisation 7)... 4) + string similarity normalisation 8)... 4) + reduced string approach

Hacking Questions: Compare the input file with the KJV.prep file for all 1) to 8) preprocessing techniques. Which methods seems to work best for you? Which does make no sense for the dataset? Compare all KJV.meta files containing some numbers! How many words have changed and by which method? (optional and advanced) what is the number of word types for each preprocessing (can be derived from KJV.prep.inv first column)

Preprocessing 1) without preprocessing Hint: Configuration file can be found in ${TRACER_HOME}/conf/tracer_conf.xml All values show false

Preprocessing 2) Removing diachritics Hint: BoolRemoveDiachritics is switched on by value true

Preprocessing 3) Lower case Hint: boolmakealllowercase is switched on by value true

Preprocessing 4) Lemmatising text Hint: boollemmatisation is switched on by value true Lemmatisation can be configured by <property name="baseform_file_name" value="data/corpora/bible/bible.lemma" />

Preprocessing 5) Synonym handling Hint: boolreplacesynonyms is switched on by value true Synonyms can be configured by <property name="synonyms_file_name" value="data/corpora/bible/bible.syns" />

Preprocessing 6) String similarity for normalising variants Hint: boolreplacestringsimilarwords is switched on by value true Thresholds:

Hacking Tasks: Redo with your own data (where possible)... 1)... without preprocessing 2)... 1) + with removing diachritics 3)... 2) + lower case option 4)... 3) + lemmatisation 5)... 4) + synonym handling 6)... 3) + string similarity normalisation 7)... 4) + string similarity normalisation 8)... 4) + reduced string approach Configuration of your own file (rename it, if necessary, as.txt -file):

Open issue: Fragmentary words

Open issue: Fragmentary words dealing with gaps and Leiden Convention Οὐιβίῳ Ἀλεξάά [ν]δρῳ τῷ κρατίστῳ ἐπιστρατήγῳ παρὰ Ἀντ[ωνίου Δ]όμνά ου τοῦ καὶ Φιλαντι[νό]οά υά Ἀντωνίοά [υ Ῥωμανο]ῦά Τραιανείου τοῦ καά [ὶ Στρα]τά είου Ἀντινοέως. [οὐκ ἂν] εἰς τοῦτο προήχθά [η]νά, ἐά πιτρόπων [μέγιστ]εά, μέ[τριος] καὶ ἀπράά γά μων ὢνά ἄνθρά [ωπος,] εά ἰ μὴ [ὓβρι]ν τὴν μά [εγ]ίστηνά ἐπά επόνθ[ειν ὑπὸ] Ὡρίωνο[ς κ]ωά μογρα[μ]μά ατέως Φ[ι]λαδελφείά [ας τῆ]ς Ἡρακλεά ίά δου μερίδοά [ς] τά οῦ Ἀρά σινοίτου. [οὗ χά]ριν μην[ύ]ω παρὰ τ[ὰ ἀ]πειρημένα ἑαά [υτὸ]νά ἐνσείσανά τα εἰς τὴν κωμογραμματείανά [μ]ήτε σιτολογήσαντα μήτε πρά [α]κτορεύσαντά α παντελῶς ἄπορον ὄν[τ]αά. δι ἣά ν αἰτίαν κά αὶ πρότερον οὐ διέλιπον ἐντυγχάά νων καὶ νῦά ν ἀξιῶ, ἐάν σου τῇ τύχῃ δόξ[ῃ], ἀκοῦσά αί μου π[ρ]ὸς αὐτὸν πρὸς τὸ τυχεῖν με τά ῆά ςά ἀπὸ σοῦ [μι]σοπονήρου ἐγδ[ι]κίας, ἵν ὦ ὑπὸά [σ]οά ῦά κατὰά πά άντα βά εά βοηθ(ημένος). διευτύχει Ἀντώνιος Δόμνά οά ς ἐπιδέδωκα.

Gap between knowledge and experience

Test questions Statement: My lemmatisation tool <XYZ> is able to compute the baseforms of 80% of all tokens in a corpus. Good or bad???

Test questions Fact file: Language variants Different writing styles (some) Dialects Diachritics OCR errors Question: What is the difference for you?

Test questions Fact file: Language variants Different writing styles (some) Dialects Diachritics OCR errors Question: What do you think is the difference for the computer?

Importance of preprocessing Cleaning and harmonising the data When working with a new corpus (not only language but also same language in a different epoch or geographical region can take up to 70% of the overall time. Preprocessing mantra: Garbage in, garbage out.

Thank you! "Stealing from one is plagiarism, stealing from many is research" (Wilson Mitzner, 1876-1933) Visit us at http://etrap.gcdh.de