TRACER - Preprocessing Marco Büchler etrap Research Group Göttingen Centre for Digital Humanities Institute of Computer Science Georg August University Göttingen, Germany
Overview What is preprocessing? Overview of preprocessing techniques Hacking Conclusion with some test questions
Reminder: Current approach
Pre-step: Segmentation - an example
Pre-step: Segmentation
Naïve (computational) method Compare every text chunk (like sentence) with each other. TLG: 5,500,000*5,500,000 = 3.025e13 comparisons Assumption: Comparison rate of 1000 sentences/sec. This process would run about 3.025e10 seconds or more than 959 years.
Naïve vs. advanced techniques Naïve method: This process would run about 3.025e10 seconds or more than 959 years. Advanced techniques: Can break this down to 1-2 processor month for TLG. Simulations on Big Data have shown (given squared complexity) that...... by increasing the amount of data of billions words, we would need several 100 processor centuries of computational power.
Question What do you associated with preprocessing?
Foundations for preprocessing Zipfian Law
Implications of the Zipfian Law Approx. 50% of all words occur only once Approx. 16% of all words occur only twice Approx. 8% of all words occur three times... Approx. 90% of all words in a corpus occur 10 times or less The top 300 700 most frequent words cover already about 50% of all tokens (depending language)
Question What does lemmatisation mean for this plot?
Preprocessing
Question What does lemmatisation mean for this plot?
Preprocessing: Directed Graph Normalisation e.g. lemmatisation
Preprocessing: Indirected Graph Normalisation e.g. synonyms, string similarity
Hacking - Installation guide for TRACER 1) Download Tracer from http://etrap.gcdh.de/hackathon/tracer/ 2) In the command line or terminal go to your download folder by typing cd (change directory). Example: cd Downloads. 3) Unzip tracer.tar.gz and change to the TRACER folder 4) Start the tool with the command: java -Xmx600m -Dfile.encoding=UTF-8 -Dde.gcdh.medusa.config.ClassConfig=conf/tracer_config.xml -jar tracer.jar Explanation: -Xmx600m (up to 600 MB memory), -Dfile.encoding sets the encoding of your input file (optionally), -Dde.gcdh.medusa.config.ClassConfig (configuration file)
Configuration
Hacking Tasks: Run the KJV text... 1)... without preprocessing 2)... 1) + with removing diachritics 3)... 2) + lower case option 4)... 3) + lemmatisation 5)... 4) + synonym handling 6)... 3) + string similarity normalisation 7)... 4) + string similarity normalisation 8)... 4) + reduced string approach
Hacking Questions: Compare the input file with the KJV.prep file for all 1) to 8) preprocessing techniques. Which methods seems to work best for you? Which does make no sense for the dataset? Compare all KJV.meta files containing some numbers! How many words have changed and by which method? (optional and advanced) what is the number of word types for each preprocessing (can be derived from KJV.prep.inv first column)
Preprocessing 1) without preprocessing Hint: Configuration file can be found in ${TRACER_HOME}/conf/tracer_conf.xml All values show false
Preprocessing 2) Removing diachritics Hint: BoolRemoveDiachritics is switched on by value true
Preprocessing 3) Lower case Hint: boolmakealllowercase is switched on by value true
Preprocessing 4) Lemmatising text Hint: boollemmatisation is switched on by value true Lemmatisation can be configured by <property name="baseform_file_name" value="data/corpora/bible/bible.lemma" />
Preprocessing 5) Synonym handling Hint: boolreplacesynonyms is switched on by value true Synonyms can be configured by <property name="synonyms_file_name" value="data/corpora/bible/bible.syns" />
Preprocessing 6) String similarity for normalising variants Hint: boolreplacestringsimilarwords is switched on by value true Thresholds:
Hacking Tasks: Redo with your own data (where possible)... 1)... without preprocessing 2)... 1) + with removing diachritics 3)... 2) + lower case option 4)... 3) + lemmatisation 5)... 4) + synonym handling 6)... 3) + string similarity normalisation 7)... 4) + string similarity normalisation 8)... 4) + reduced string approach Configuration of your own file (rename it, if necessary, as.txt -file):
Open issue: Fragmentary words
Open issue: Fragmentary words dealing with gaps and Leiden Convention Οὐιβίῳ Ἀλεξάά [ν]δρῳ τῷ κρατίστῳ ἐπιστρατήγῳ παρὰ Ἀντ[ωνίου Δ]όμνά ου τοῦ καὶ Φιλαντι[νό]οά υά Ἀντωνίοά [υ Ῥωμανο]ῦά Τραιανείου τοῦ καά [ὶ Στρα]τά είου Ἀντινοέως. [οὐκ ἂν] εἰς τοῦτο προήχθά [η]νά, ἐά πιτρόπων [μέγιστ]εά, μέ[τριος] καὶ ἀπράά γά μων ὢνά ἄνθρά [ωπος,] εά ἰ μὴ [ὓβρι]ν τὴν μά [εγ]ίστηνά ἐπά επόνθ[ειν ὑπὸ] Ὡρίωνο[ς κ]ωά μογρα[μ]μά ατέως Φ[ι]λαδελφείά [ας τῆ]ς Ἡρακλεά ίά δου μερίδοά [ς] τά οῦ Ἀρά σινοίτου. [οὗ χά]ριν μην[ύ]ω παρὰ τ[ὰ ἀ]πειρημένα ἑαά [υτὸ]νά ἐνσείσανά τα εἰς τὴν κωμογραμματείανά [μ]ήτε σιτολογήσαντα μήτε πρά [α]κτορεύσαντά α παντελῶς ἄπορον ὄν[τ]αά. δι ἣά ν αἰτίαν κά αὶ πρότερον οὐ διέλιπον ἐντυγχάά νων καὶ νῦά ν ἀξιῶ, ἐάν σου τῇ τύχῃ δόξ[ῃ], ἀκοῦσά αί μου π[ρ]ὸς αὐτὸν πρὸς τὸ τυχεῖν με τά ῆά ςά ἀπὸ σοῦ [μι]σοπονήρου ἐγδ[ι]κίας, ἵν ὦ ὑπὸά [σ]οά ῦά κατὰά πά άντα βά εά βοηθ(ημένος). διευτύχει Ἀντώνιος Δόμνά οά ς ἐπιδέδωκα.
Gap between knowledge and experience
Test questions Statement: My lemmatisation tool <XYZ> is able to compute the baseforms of 80% of all tokens in a corpus. Good or bad???
Test questions Fact file: Language variants Different writing styles (some) Dialects Diachritics OCR errors Question: What is the difference for you?
Test questions Fact file: Language variants Different writing styles (some) Dialects Diachritics OCR errors Question: What do you think is the difference for the computer?
Importance of preprocessing Cleaning and harmonising the data When working with a new corpus (not only language but also same language in a different epoch or geographical region can take up to 70% of the overall time. Preprocessing mantra: Garbage in, garbage out.
Thank you! "Stealing from one is plagiarism, stealing from many is research" (Wilson Mitzner, 1876-1933) Visit us at http://etrap.gcdh.de