A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora Georgios Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th October 2014 14/10/2014 1
Overview Background Parallel Corpus Problem Motivation Methods Random Forest Classifier Statistical Phrase Alignment Hybrid Approach Experiments English-Greek & English-Romanian Error Analysis Conclusions Discussion Future Work 2
Background: Parallel Corpus A parallel corpus is a collection of documents in a source language paired with their direct translation in a target language English Abraxane monotherapy is indicated for the treatment of metastatic breast cancer Greek η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού 3
Background: Parallel Corpus 1) Useful for SMT 2) Relatively scarce resources Koehn (2005) trained 110 SMT systems (11 languages) in three weeks. Available finance, law, medicine etc. 3) Excellent resources for mining bilingual terminologies Exact translations => No missing translations of terms sentence aligned => limited search space of candidate translations Same size => term frequencies are comparable 4
Background: Problem Parallel Corpus Term Alignment Dictionary of MWT Abraxane monotherapy is indicated for the treatment of metastatic breast cancer η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού metastatic breast cancer µεταστατικού καρκίνου µαστού 5
%Coverage of English UMLS Background: Biomedical Domain Existing resources in the biomedical domain remain incomplete UMLS A multilingual terminological resource (more than 20 languages) Indexes ~7.6M English terms 18.00% 16.00% 14.00% 12.00% ~6.3M missing tranlsations 16.44% 10.00% 8.00% 6.00% expand UMLS for English-Greek and English-Romanian 4.00% 2.00% 0.00% 1.72% 2.88% 2.59% 2.43% 1.21% 1.79% 3.26% 0.55% 2.06% 1.40% 7
Methodology: Term Alignment Pipeline Parallel Corpus MetaMap Term Alignment Link to UMLS Abraxane monotherapy is indicated for the treatment of metastatic breast cancer C0278488, Neoplastic Process η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού 8
Methodology: Term Alignment Algorithms Supervised machine learning method Random Forest Classifier (EACL 2014, EMNLP 2014) Exploits internal structure of terms (character n-gram feature representation) Requires positive and negative instances for training Out-of-domain seed dictionary (i.e. BabelNet) Unsupervised approach Part of Moses SMT (Koehn et al., 2007) Statistical Phrase Alignment (Koehn et al., 2003) (Out of the box solution) Exploits co-occurrences of source and target terms Works well for frequently occurring terms Performance decreases for rare terms 9
Methodology: Hybrid Approach For s to be translated, RF and SPA suggest N ranked candidate translations Classification margin Translation probability SPA 1) του σακχαρώδη διαβήτη τύπου 2 2) σακχαρώδη διαβήτη τύπου 2 3) σακχαρώδους διαβήτη τύπου 2 type 2 diabetes mellitus RF 1) διαβήτη τύπου 2 2) διαβήτη τύπου 2 και καρδιακή 3) σακχαρώδη διαβήτη τύπου 2 13
Methodology: Hybrid Approach Dictionaries containing N candidate translations have a limited number of applications (e.g., SMT) To enrich existing terminologies, human curators need to post-edit the output of term alignment methods Objective is to improve the precision of higher ranking candidates (precision@n=1) Intersection of RF and SPA; ranking candidates according to translation probability by SPA SPA 1) του σακχαρώδη διαβήτη τύπου 2 2) σακχαρώδη διαβήτη τύπου 2 3) σακχαρώδους διαβήτη τύπου 2 type 2 diabetes mellitus Voting RF 1) διαβήτη τύπου 2 2) διαβήτη τύπου 2 και καρδιακή 3) σακχαρώδη διαβήτη τύπου 2 1) σακχαρώδη διαβήτη τύπου 2 14
Experiments: Corpora EMEA (Tiedemann, 2009), a biomedical parallel corpus from European Medicines Agency - 1.5K sentence aligned documents in 22 languages - Drug usage guidelines en el en ro - 372K sentences - 17,907 unique English MWTs - 321K sentences - 16,625 unique English MWTs 15
Experiments: Evaluation Randomly sampled 1,000 English MWTs for each English MWT, we selected the top 20 translation candidates. en-el RF SPA Voting en-ro RF SPA Voting 16
Precision Experiments: Results 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 RF SPA RF+SPA 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # candidate translations per source term English-Greek dataset 18
Precision Experiments: Results 1 RF SPA RF+SPA 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # candidate translations per source term English-Romanian dataset 19
Recall Experiments: Results 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 RF SPA RF+SPA 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # candidate translations per source term English-Greek dataset 20
Recall Experiments: Results 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 RF SPA RF+SPA 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # candidate translations per source term English-Romanian dataset 21
Error Analysis RF Partial matches urea cycle disorder discontinuous translations metabolic diseases (disorder) (cycle) (urea) διαταραχών του κύκλου της ουρίας (diseases) (hereditary) (metabolic) boli ereditare de metabolism SPA Statistically-based tool. -Performance largely affected by term frequency top-20 precision on terms having varying frequency 22
Top-20 Precision Error Analysis 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 SPA RF RF + SPA Performance decreases for lower frequency terms [100 200] [50 100] [25 50] [15 25] [10 15] [5 10] [1 5] frequency ranges English-Greek dataset 23
top-20 Precision Error Analysis 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 SPA RF RF + SPA [100 200] [50 100] [25 50] [15 25] [10 15] [5 10] [1 5] frequency ranges English-Romanian dataset 24
Discussion Hybrid approach Compilation of bilingual terminologies from parallel corpora Enrich UMLS with two under-resource languages Observations: Substantially improves top-1 precision of RF and SPA Outperforms SPA when translating low-frequency terms Low recall 25
Future Work Investigate integration of bilingual terminologies with SMT Parallel corpus SPA SPA RF Phrase table SMT LM Lower top-1 precision Poor performance for low-frequency terms 26
Questions? 27