Zero shot translation and language visualization with mbert

Σχετικά έγγραφα
Συντακτικές λειτουργίες

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

2 Composition. Invertible Mappings

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

The Simply Typed Lambda Calculus

C.S. 430 Assignment 6, Sample Solutions

Math 6 SL Probability Distributions Practice Test Mark Scheme

derivation of the Laplacian from rectangular to spherical coordinates

TMA4115 Matematikk 3

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Homework 3 Solutions

Μηχανική Μάθηση Hypothesis Testing

EE512: Error Control Coding

ΦΥΛΛΟ ΕΡΓΑΣΙΑΣ Α. Διαβάστε τις ειδήσεις και εν συνεχεία σημειώστε. Οπτική γωνία είδησης 1:.

4.6 Autoregressive Moving Average Model ARMA(1,1)

Potential Dividers. 46 minutes. 46 marks. Page 1 of 11

Other Test Constructions: Likelihood Ratio & Bayes Tests

6.3 Forecasting ARMA processes

ΕΠΙΧΕΙΡΗΣΙΑΚΗ ΑΛΛΗΛΟΓΡΑΦΙΑ ΚΑΙ ΕΠΙΚΟΙΝΩΝΙΑ ΣΤΗΝ ΑΓΓΛΙΚΗ ΓΛΩΣΣΑ

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΤΜΗΜΑ ΝΟΣΗΛΕΥΤΙΚΗΣ

Statistical Inference I Locally most powerful tests

Section 8.3 Trigonometric Equations

ΑΠΟΔΟΤΙΚΗ ΑΠΟΤΙΜΗΣΗ ΕΡΩΤΗΣΕΩΝ OLAP Η ΜΕΤΑΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ ΕΞΕΙΔΙΚΕΥΣΗΣ. Υποβάλλεται στην

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Approximation of distance between locations on earth given by latitude and longitude

Assalamu `alaikum wr. wb.

Math221: HW# 1 solutions

Démographie spatiale/spatial Demography

Example Sheet 3 Solutions

ΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ. ΘΕΜΑ: «ιερεύνηση της σχέσης µεταξύ φωνηµικής επίγνωσης και ορθογραφικής δεξιότητας σε παιδιά προσχολικής ηλικίας»

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

Η ΠΡΟΣΩΠΙΚΗ ΟΡΙΟΘΕΤΗΣΗ ΤΟΥ ΧΩΡΟΥ Η ΠΕΡΙΠΤΩΣΗ ΤΩΝ CHAT ROOMS

Right Rear Door. Let's now finish the door hinge saga with the right rear door

Λέξεις, φράσεις και προτάσεις

5.4 The Poisson Distribution.

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

department listing department name αχχουντσ ϕανε βαλικτ δδσϕηασδδη σδηφγ ασκϕηλκ τεχηνιχαλ αλαν ϕουν διξ τεχηνιχαλ ϕοην µαριανι

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Figure A.2: MPC and MPCP Age Profiles (estimating ρ, ρ = 2, φ = 0.03)..

Instruction Execution Times

Concrete Mathematics Exercises from 30 September 2016

Η αλληλεπίδραση ανάμεσα στην καθημερινή γλώσσα και την επιστημονική ορολογία: παράδειγμα από το πεδίο της Κοσμολογίας

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΕΙΡΑΙΑ ΤΜΗΜΑ ΝΑΥΤΙΛΙΑΚΩΝ ΣΠΟΥΔΩΝ ΠΡΟΓΡΑΜΜΑ ΜΕΤΑΠΤΥΧΙΑΚΩΝ ΣΠΟΥΔΩΝ ΣΤΗΝ ΝΑΥΤΙΛΙΑ

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

14 Lesson 2: The Omega Verb - Present Tense

ΕΠΙΧΕΙΡΗΣΙΑΚΗ ΑΛΛΗΛΟΓΡΑΦΙΑ ΚΑΙ ΕΠΙΚΟΙΝΩΝΙΑ ΣΤΗΝ ΑΓΓΛΙΚΗ ΓΛΩΣΣΑ

Section 9.2 Polar Equations and Graphs

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ ΒΑΛΕΝΤΙΝΑ ΠΑΠΑΔΟΠΟΥΛΟΥ Α.Μ.: 09/061. Υπεύθυνος Καθηγητής: Σάββας Μακρίδης

LECTURE 2 CONTEXT FREE GRAMMARS CONTENTS

Strain gauge and rosettes

PARTIAL NOTES for 6.1 Trigonometric Identities

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

[1] P Q. Fig. 3.1

the total number of electrons passing through the lamp.

ΓΕΩΜΕΣΡΙΚΗ ΣΕΚΜΗΡΙΩΗ ΣΟΤ ΙΕΡΟΤ ΝΑΟΤ ΣΟΤ ΣΙΜΙΟΤ ΣΑΤΡΟΤ ΣΟ ΠΕΛΕΝΔΡΙ ΣΗ ΚΤΠΡΟΤ ΜΕ ΕΦΑΡΜΟΓΗ ΑΤΣΟΜΑΣΟΠΟΙΗΜΕΝΟΤ ΤΣΗΜΑΣΟ ΨΗΦΙΑΚΗ ΦΩΣΟΓΡΑΜΜΕΣΡΙΑ

Block Ciphers Modes. Ramki Thurimella

Lecture 2. Soundness and completeness of propositional logic

Section 7.6 Double and Half Angle Formulas

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Second Order RLC Filters

Congruence Classes of Invertible Matrices of Order 3 over F 2

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Προσομοίωση BP με το Bizagi Modeler

F-TF Sum and Difference angle

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

ΚΥΠΡΙΑΚΟΣ ΣΥΝΔΕΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY 21 ος ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ Δεύτερος Γύρος - 30 Μαρτίου 2011

The challenges of non-stable predicates

Chapter 2 * * * * * * * Introduction to Verbs * * * * * * *

HISTOGRAMS AND PERCENTILES What is the 25 th percentile of a histogram? What is the 50 th percentile for the cigarette histogram?

Every set of first-order formulas is equivalent to an independent set

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

«ΑΓΡΟΤΟΥΡΙΣΜΟΣ ΚΑΙ ΤΟΠΙΚΗ ΑΝΑΠΤΥΞΗ: Ο ΡΟΛΟΣ ΤΩΝ ΝΕΩΝ ΤΕΧΝΟΛΟΓΙΩΝ ΣΤΗΝ ΠΡΟΩΘΗΣΗ ΤΩΝ ΓΥΝΑΙΚΕΙΩΝ ΣΥΝΕΤΑΙΡΙΣΜΩΝ»

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Fractional Colorings and Zykov Products of graphs

Finite Field Problems: Solutions

ΦΩΤΟΓΡΑΜΜΕΤΡΙΚΕΣ ΚΑΙ ΤΗΛΕΠΙΣΚΟΠΙΚΕΣ ΜΕΘΟΔΟΙ ΣΤΗ ΜΕΛΕΤΗ ΘΕΜΑΤΩΝ ΔΑΣΙΚΟΥ ΠΕΡΙΒΑΛΛΟΝΤΟΣ

Code Breaker. TEACHER s NOTES

Μηχανισμοί πρόβλεψης προσήμων σε προσημασμένα μοντέλα κοινωνικών δικτύων ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ

ΕΠΙΧΕΙΡΗΣΙΑΚΗ ΑΛΛΗΛΟΓΡΑΦΙΑ ΚΑΙ ΕΠΙΚΟΙΝΩΝΙΑ ΣΤΗΝ ΑΓΓΛΙΚΗ ΓΛΩΣΣΑ

CRASH COURSE IN PRECALCULUS

ΑΡΙΣΤΟΤΕΛΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΟΝΙΚΗΣ

«Χρήσεις γης, αξίες γης και κυκλοφοριακές ρυθμίσεις στο Δήμο Χαλκιδέων. Η μεταξύ τους σχέση και εξέλιξη.»

Section 1: Listening and responding. Presenter: Niki Farfara MGTAV VCE Seminar 7 August 2016

Partial Differential Equations in Biology The boundary element method. March 26, 2013

On a four-dimensional hyperbolic manifold with finite volume

9.09. # 1. Area inside the oval limaçon r = cos θ. To graph, start with θ = 0 so r = 6. Compute dr

Solutions to Exercise Sheet 5

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

ΖΩΝΟΠΟΙΗΣΗ ΤΗΣ ΚΑΤΟΛΙΣΘΗΤΙΚΗΣ ΕΠΙΚΙΝΔΥΝΟΤΗΤΑΣ ΣΤΟ ΟΡΟΣ ΠΗΛΙΟ ΜΕ ΤΗ ΣΥΜΒΟΛΗ ΔΕΔΟΜΕΝΩΝ ΣΥΜΒΟΛΟΜΕΤΡΙΑΣ ΜΟΝΙΜΩΝ ΣΚΕΔΑΣΤΩΝ

Example of the Baum-Welch Algorithm

Modern Greek Extension

Matrices and Determinants

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ

Οι αδελφοί Montgolfier: Ψηφιακή αφήγηση The Montgolfier Βrothers Digital Story (προτείνεται να διδαχθεί στο Unit 4, Lesson 3, Αγγλικά Στ Δημοτικού)

Transcript:

NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS SCHOOL OF SCIENCE DEPARTMENT OF INFORMATICS AND TELECOMMUNICATION BSc THESIS Zero shot translation and language visualization with mbert Athanasios P. Kalampokas Supervisors: Manolis Koubarakis, Professor Christos-Charalampos Papadopoulos, MSc candidate Athens June 2021

A. Kalampokas 1

ΕΘΝΙΚΟ ΚΑΙ ΚΑΠΟΔΙΣΤΡΙΑΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ ΣΧΟΛΗ ΘΕΤΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΠΙΚΟΙΝΩΝΙΩΝ ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ Απευθείας μετάφραση και οπτικοποίηση γλωσσών με χρήση του mbert Αθανάσιος Π. Καλαμπόκας Επιβλέποντες: Μανόλης Κουμπαράκης, Καθηγητής Χρήστος-Χαράλαμπος Παπαδόπουλος, Μεταπτυχιακός Αθήνα Ιούνιος 2021 A. Kalampokas 2

A. Kalampokas 3

BSc THESIS Athanasios P. Kalampokas S.N.: 1115201400054 Supervisors: Manolis Koubarakis, Professor Christos-Charalampos Papadopoulos, MSc candidate A. Kalampokas 4

A. Kalampokas 5

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ Απευθείας μετάφραση και οπτικοποίηση γλωσσών με χρήση του mbert Αθανάσιος Π. Καλαμπόκας Α.Μ.: 1115201400054 Επιβλέποντες: Μανόλης Κουμπαράκης, Καθηγητής Χρήστος-Χαράλαμπος Παπαδόπουλος, Μεταπτυχιακός A. Kalampokas 6

A. Kalampokas 7

SUMMARY Google s BERT model was released in 2018 and managed to surpass almost all existing methodologies concerning machine learning, NLP tasks. Recently a multilingual version of BERT was released, multilingual BERT or mbert for short. In this thesis, mbert s capabilities to translate words between English and other languages without further training. More precisely two different methods are used. The first uses mbert s masked language model, along with a given prompt/template in order to predict the correct translation of a word. The second method, builds representations of the various languages and translates a word by subtracting and then adding to its representation, the representations of the languages for which we want to do the translation. Then again by using mbert s masked language model at the result, the prediction is made for the translation. Aside from translation, I examine how much information about the language a word belongs to, is withheld in mbert s representations. By using the tsne algorithm, the two main components of the representations are obtained and they are used for a two dimensional plot of the words. In both cases the results are encouraging. Both translation methods often find the correct translation, although their usability is somewhat constrained by mbert s limited vocabulary for many languages. The words scatterplot shows clearly that there is a direct correlation between the language they belong to and their neighboring points. The experiments below are an extension of the experiments performed in the It s not Greek to mbert paper by Hila Gonen, Shauli Ravfogel, Yanai Elazar, Yoav Goldberg of Bar Ilan University, Computer Science department. SUBJECT AREA: Natural language processing KEYWORDS: language models, machine learning, BERT, mbert, translation, word visualization A. Kalampokas 8

ΠΕΡΙΛΗΨΗ Το μοντέλο BERT της google κυκλοφόρησε το 2018 και κατάφερε να ξεπεράσει σχεδόν όλες τις υπάρχουσες μεθόδους όσον αφορά προβλήματα μηχανικής μάθησης που έχουν να κάνουν με επεξεργασία φυσικής γλώσσας (NLP). Πρόσφατα δημιουργήθηκε και μια πολυγλωσσική έκδοση του BERT, το multilingual BERT ή mbert. Σε αυτή τη διπλωματική εργασία, εξετάζονται οι δυνατότητες του πολυγλωσσικού αυτού μοντέλου mbert, να μεταφράζει λέξεις από τα Αγγλικά σε άλλες γλώσσες, χωρίς περαιτέρω εκπαίδευση. Για την ακρίβεια γίνεται χρήση δύο διαφορετικών μεθόδων. Η πρώτη κάνει χρήση του masked language model του mbert σε συνδυασμό με μια πρόταση που δίνεται, προκειμένου να προβλέψει τη σωστή μετάφραση μιας λέξης. Η δεύτερη μέθοδος, χτίζει αναπαραστάσεις των διάφορων γλωσσών, και κάνει τη μετάφραση μιας λέξης αφαιρώντας και προσθέτοντας στην αναπαράσταση αυτής της λέξης (στο mbert), τις αναπαραστάσεις των γλωσσών, ανάμεσα από τις οποίες θέλουμε να γίνει η μετάφραση. Έπειτα πάλι με χρήση του MASKED LANGUAGE MODEL στο αποτέλεσμα, κάνει την πρόβλεψη για τη μετάφραση. Πέρα από τη μετάφραση, ελέγχεται κατά πόσο οι αναπαραστάσεις που δίνει στις διάφορες λέξεις το mbert, επηρεάζονται από τη γλώσσα την οποία ανήκουν. Με χρήση του αλγορίθμου tsne, λαμβάνονται οι δυο κυριότερες συνιστώσες αναπαραστάσεων λέξεων και σύμφωνα με αυτές, σχεδιάζονται οι λέξεις στο επίπεδο. Τα αποτελέσματα είναι και στις δύο περιπτώσεις ενθαρρυντικά. Και οι δύο μέθοδοι βρίσκουν τη σωστή μετάφραση αρκετές φορές, ωστόσο η ευχρηστία τους πιθανότατα περιορίζεται από το μέγεθος του λεξιλογίου του mbert για διάφορες γλώσσες. Ο σχεδιασμός των λέξεων δείχνει πως υπάρχει ξεκάθαρη ομαδοποίηση τους, ανάλογα με τη γλώσσα την οποία ανήκουν. Τα παρακάτω πειράματα αποτελούν επέκταση των πειραμάτων του ακαδημαϊκού paper It s not Greek to mbert των Hila Gonen, Shauli Ravfogel, Yanai Elazar, Yoav Goldberg του τμήματος Computer Science πανεπιστημίου Bar Ilan. ΘΕΜΑΤΙΚΗ ΠΕΡΙΟΧΗ: Επεξεργασία φυσικής γλώσσας ΛΕΞΕΙΣ ΚΛΕΙΔΙΑ: γλωσσικά μοντέλα, μηχανική μάθηση, BERT, mbert, μετάφραση, οπτικοποίηση λέξεων A. Kalampokas 9

CONTENTS 1. INTRODUCTION...14 2. RELATED WORK.15 2.1 Embeddings and language models.....15 2.2 Attention mechanism and transformers....16 2.3 The language model BERT....18 2.4 The multilingual version of BERT, mbert...20 3. ZERO SHOT TRANSLATION WITH mbert..21 3.1 Template based approach.....22 3.2 Analogies based approach.....26 3.3 POS analysis...27 4. VISUALIZATION...38 4.1 The tsne algorithm...38 4.2 Representation plots...39 5. CONCLUSION 65 APPENDIX.66 REFERENCES..69 A. Kalampokas 10

LIST OF IMAGES: Image 1: Transformer architecture..17 Image 2: Bert architecture.19 Image 3: Bert input.20 Image 4: Estonian...39 Image 5: Latvian..40 Image 6: Lithuanian....40 Image 7: French..41 Image 8: Italian....41 Image 9: Spanish...42 Image 10: Portuguese..42 Image 11: Catalan...43 Image 12: Romanian 43 Image 13: Latin.44 Image 14: Bengali.44 Image 15: Hindu...45 Image 16: Kannada..45 Image 17: Lak...46 Image 18: Malayalam..46 Image 19: Tamil.47 Image 20: Telugu..47 Image 21: Arabic...48 Image 22: Farsi....48 Image 23: Danish..49 Image 24: Finnish...49 Image 25: Norwegian...50 Image 26: Swedish...50 Image 27: Bulgarian...51 Image 28: Russian...51 Image 29: Ukrainian.52 Image 30: Croatian...53 Image 31: Czech...53 Image 32: Slovak..54 Image 33: Polish...54 Image 34: Armenian....55 Image 35: Greek.....55 Image 36: Dutch......56 Image 37: German......56 Image 38: Tatar.....57 Image 39: Turkish....57 Image 40: Basque....58 A. Kalampokas 11

Image 41: Breton....58 Image 42: Hungarian...59 Image 43: Japanese...59 Image 44: Korean.....60 Image 45: Irish....60 Image 46: Welsh...61 Image 47: Hebrew.62 Image 48: Georgian..62 Image 49: Albanian...63 Image 50: Ket.... 63 Image 51: All languages.....64 A. Kalampokas 12

LIST OF TABLES: Table 1: Template results..23 Table 2: Analogies results.26 Table 3: Τemplate POS results 28 Table 4: Analogies POS results...36 Table 5: Translations between other languages...66 A. Kalampokas 13

1. INTRODUCTION Nowadays, with the advent and spread of machine learning, many domains have incorporated it, especially deep learning which is a subset of machine learning. Meaning that traditional methods have been replaced by machine learning methods, or modified as to be used with machine learning algorithms. Now that computational capabilities of computers as well as quantities of data have significantly increased, neural networks (i.e. deep learning) are at the forefront of many of these domains and applications. One such domain is natural language processing (NLP). NLP deals with building computational algorithms to automatically analyze and represent human language. For a long time the majority of methods used to study NLP problems employed shallow machine learning models and time consuming, hand crafted features. This led to problems such as the curse of dimensionality, since linguistic information was represented with sparse representations (vectors with a very high dimension). However with the recent popularity and success of word embeddings, neural based models have achieved superior results on various language related tasks, as compared to traditional machine learning models (like SVM and logistic regression) [1] In this thesis, I examine the capability of mbert (a transformer based, multilingual, deep learning model, more on that later), to correctly translate words between languages without any further training (zero shot). I also visualize the word representations extracted from mbert, for various languages and show that these visualisations cluster according to language. The organization of the rest of this dissertation is as follows. Chapter 2 presents related work, chapter 3 shows the results of zero shot translation with 2 different methods, as well as a part of speech analysis. In chapter 4 the visualizations of various languages are presented. Finally a conclusion follows. A. Kalampokas 14

2. RELATED WORK This chapter offers an overview of some commonly used NLP mechanisms like word embeddings and language models, as well as the more recent attention mechanism and the Transformer architecture. Then the language models of BERT and its multilingual counterpart mbert are discussed. 2.1 Embeddings and language models But how are words represented, so as to work with machine learning algorithms? The answer is numerical vectors. Earlier more basic methods, such as one hot encoding and tf-idf generate sparse, very high dimensional vectors to represent the words to be fed to a machine learning model. These methods have now been mostly abandoned, or used as a baseline for more complex and computationally efficient ones, like word embeddings. An embedding is also a vector that represents a word in the corpus vocabulary. Its values are learned through supervised techniques, such as neural network models trained in tasks like sentiment analysis, or document classification or through unsupervised techniques like statistical analysis of documents. They try to capture the semantic, contextual and syntactic meaning of the words in our vocabulary, based on their usage in sentences. Word embeddings, though more complicated to generate, are prefered to more basic methods, in terms of scalability, sparsity and contextual dependency [2] Some models for generating word embeddings are: - the word2vec model, which uses the CBOW (where we try to predict a target word based on its context) and the skipgram (which is the reverse of the CBOW, where we try to predict the context word for a given target word) methods, in order to calculate the embedding of a word. - glove, which is similar to word2vec but, unlike it, glove does not rely just on local statistics (local context information of words), but incorporates global statistics (word co occurrence) to obtain word vectors.[3] A. Kalampokas 15

- Elmo is a word embedding method for representing a sequence of words as a corresponding sequence of vectors. Character-level tokens are taken as the inputs to a bi-directional LSTM which produces word-level embeddings. Elmo embeddings are context sensitive, meaning that different embeddings will be produced for words that have the same spelling but different meaning (unlike word2vec and glove), E.g., the word bank will have different representations in the phrases river bank and bank balance [4] 2.2 Attention mechanism and transformers Up to a point, recurrent neural networks such as lstms and grus had been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation. Attention mechanisms[5] were also used with such recurrent neural architectures, but a new module, the transformer has been proposed, which relies entirely on self attention to compute representations of its input and output An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. Usually the query, keys and values are packed into matrices. Most commonly scaled dot-product attention is used where after multiplying the query and key matrices, the result is scaled and a soft max is applied to it, before multiplying with the values. So for given query, keys, values : Attention(Q, K, V ) = softmax(qkt dk )V The query, keys, values can be either word embeddings, for example from glove (in the first attention layer), or intermediate representations from previous layers. The key/value/query concepts come from retrieval systems. For example, when you type a query to search for some video on Youtube, the search engine will map your query against a set of keys (video title, description, etc.) associated with candidate videos in the database, then present you the best matched videos (values). Multi head attention, is widely used, where the queries, keys and values are projected to h different dimensions, afterwards those projections are passed to a scaled dot product attention and then the results are concatenated to calculate the final values. A. Kalampokas 16

Attention gave birth to the transformer module, whose architecture relies mostly on attention mechanisms. It also allows greater parallelisation and transformer based models surpassed previous state of the art performance, in NLP tasks. The architecture of the transformer is the following Figure 1: the architecture of the transformer The left part is called the encoder and the right part is called the decoder. In actuality the transformers module is composed of 6 such encoders and 6 such decoders. The encoder has 2 sublayers, one that performs multi head attention and a feed forward layer after that. Residual connections and normalization are also used, as can be seen in the image. The decoder is similar to the encoder, with the difference that it has an extra layer to perform multihead attention on the encoder s output. In addition self attention in the decoder is modified so positions don t attend to subsequent positions. Input and output embeddings are simply embedding layers/ word embeddings. A. Kalampokas 17

But aside from the above very brief and superficial description, on the workings of attention and the transformers modules, no further details will be explored here. What is more important, is the BERT model presented below. 2.3 The language model BERT BERT [6] stands for: bidirectional encoder representations from transformers. It is a language model trained on data from the Book Corpus (800m words) and English Wikipedia (2,500m words). The two tasks on which it was trained are the MLM (masked language modelling) and the NSP (next sentence prediction) tasks. In MLM a random word from a sentence is masked and then the model tries to predict it. 15% of all tokens in each sequence are masked at random. The MLM task helps to train a deep bidirectional representation which is more powerful and better than left-to-right, right-to-left training or a concatenation of the two (like Elmo does it) In NSP, given a pair of sentences, we are trying to predict whether the second sentence is an inference of the first one. 50% of the training example sentence pairs have the second sentence labeled as isnext, while the rest have it labeled as notnext. Both these tasks are general enough to train bidirectional representations that perform well with minimal fine tuning, in many other tasks, beating plenty of previously set benchmarks. Even tasks that are not similar to MLM and NSP gain increased performance with BERT. Some of the tasks in which BERT set a new record, at the time of release are, question answering, named entity recognition, multi genre natural language inference, text classification [7] and so on. It is important to note that BERT is mostly used with a fine tuning approach rather than a feature based one. In fine tuning for a certain task, usually a single layer is added on top of the existing architecture and all the parameters are evaluated from the training dataset, that is both the existing BERT parameters (which change slightly thus the term fine tuning) and the new parameters introduced by adding the new layer. Whereas in the feature based approach (like Elmo), the pretrained representations are used as inputs for a downstream task, and only the parameters concerning added layers are evaluated. Feature based approach also works with BERT, but not as well as fine tuning. A. Kalampokas 18

BERT is composed of transformer encoders with bidirectional self attention heads. BERTbase has 12 such encoders for around 110 million parameters and BERTlarge has 24 of them with about 345 million parameters. Figure 2: The series of BERT s 12 encoder layers The tokenizer used is wordpiece. BERT can take as input one or two sequences, depending on the task at hand. The [CLS] token is always placed as the first input token. Its representation after going through BERT may be used for classification tasks. After the first sequence the token [SEP] is placed which indicates that the first sequence has ended, followed by the second sequence (if it exists). Along with the tokens, BERT requires an input of segments and positions. Segments differentiate between the first and second sequence and positions encode the absolute position of each token in the sequence. For example: A. Kalampokas 19

Figure 3: How the input is formatted when given to BERT The hidden representations of the tokens become more task specific as we dive deeper into each encoder layer. While in the earlier layers the representations are more general. Sometimes we use the hidden representation of the last layer of a token, often we compute an aggregate (sum, median, concatenation etc) of the last few hidden representations. This is mostly trial and error, with no rule of thumb and varies among different tasks. Still, the reason of BERT s state of the art performance on natural language understanding tasks is not yet well understood. Current research has focused on investigating the relationship behind BERT s output as a result of carefully chosen input sequences, analysis of internal vector representations through probing classifiers and the relationships represented by attention weights. The impact of BERT has been huge. By October 2020 almost every single English language query on Google, was processed by BERT [8] 2.4 The multilingual version of BERT, mbert A multilingual version of BERT was developed and released along with BERT supporting 104 languages. Each language had train data from its entire wikipedia dump. However the size of the wikipedia for a given language varies greatly and therefore low resource languages may be underrepresented in terms of the neural network model. To counteract this, exponentially smoothed weighting was performed on the data during pre training. The result is that high resource languages will be undersampled and low resource ones will be over sampled. E.g. in the original A. Kalampokas 20

distribution English would be sampled 1000x more than Icelandic, but after smoothing it's only sampled 100x more. [9] Since its release, many experiments have been carried out with mbert like cross lingual document classification, part of speech tagging, NER etc. Zero shot transfer on these tasks showed some success.[10] It was also shown that mbert representations carry a language specific and a language neutral component. The language specific component can be approximated by taking the mean vector of a sufficient number of word or sentence representations of mbert that belong to the same language. It follows that the language neutral component of a word / sentence can be obtained by subtracting this language specific component from its representation. [11] This is the core idea behind one of the two methods of translation I tested using mbert. Results for tasks like language generation and cloze tests (in essence masked language modelling) were not nearly as encouraging though (especially for lower resource languages), and in any case, the performance of monolingual models surpasses that of mbert in these kinds of trials. A very interesting paper [12] performs such measurements with mbert for the English, German and the Nordic languages. The English and German monolingual BERTs are capable of capturing syntactic phenomena of these languages (although mbert performed well too, on classifying whether an auxiliary is the main auxiliary of its sentence) and perform well on generating coherent varied language, given a subject, even though they are not trained on this objective specifically. This is not the case with mbert, where in generating language, many times it even fails to stick to the language off the seed, much less staying on topic. Overall, the results show that mbert cannot substitute its monolingual counterparts (however comparison was only possible with the English and German BERTs, as Nordic ones do not exist, at least at the time when the paper was written). 3. ZERO SHOT TRANSLATION WITH mbert The object was to recreate the experiments of another paper [13], in which the authors attempt to evaluate how well mbert performs on zero shot translation (from English to other languages). The dataset used is NorthEurAlex which is a lexical database that holds the translations of 1016 words (concepts) across 107 languages. They tried two different approaches, a template based one and an analogies approach. A. Kalampokas 21

16 languages were used for both methods. Also they visualized the words from different languages (using the t-sne algorithm), and showed that they cluster according to language identity. In recreating the experiments, 47 languages were used for the template method and 11 for the analogies and the results are presented for each language separately and not altogether, like in the paper. Also a POS analysis was done for the translated words. It naturally turned out that most of the successfully translated words were either nouns or adjectives. Numbers were also translated mostly correctly, verbs had not much success however. One important limitation for which there is no straightforward solution (other than retraining mbert with a much larger vocabulary) is the fact that only words that are tokenized by mberts tokenizer (word piece) into a single token can be used for the task. Meaning that for a given concept in the NorthEurAlex dataset, only if the English word for it ends up tokenized in a single token, it will be used. And even if this happens for the English word, if its translation in another language does not get tokenized into a single token as well, then the translation will be omitted. As a result, low resource languages may end with very few tokens (e.g for Korean, only 42 tokens had both the English word and the Korean translation, tokenized in only one token, while for Spanish this number is 452). 3.1 Template based approach This is a prompt based method, to query mbert and extract a translation for an English word. A prompt is a phrase, with a [MASK] token in place of one of its words. By querying a language model with a prompt, we are given an output of the words, in the model s vocab, that are most likely to be candidates for replacing the [MASK] token (according to the model). Of course the context (meaning the words of the prompt and their order), play a big part in the result predicted by the model So it is obvious that in such methods the choice of the prompt is an important issue. It has been shown that the prompt has a significant effect on the output and the kind of knowledge that can be extracted from language models. There are several methods for choosing a prompt. It can be created manually, or fancier techniques may be used to generate an adequate prompt. For example mining based generation (where the most frequent words from a huge corpus, that describe the relation to be queried between a subject and an object are chosen), dependency based prompts (which uses a dependency parser and takes into consideration syntactic analysis), paraphrasing etc. A paper containing research on prompt generation can be found in the references [14]. A. Kalampokas 22

The template used for translating an English source word to a target language is the following: The word SOURCE in LANGUAGE is [MASK]. After feeding this template to mbert, the last hidden representation of the [MASK] token is given as input to mbert s MLM (masked language model) to predict the words suitable to replace the [MASK] token. Specifically the code loops over every word in the dataset. If the English word does not get tokenized to a single token, it is skipped. Otherwise for every one of the 47 languages, if the translation in that language gets tokenized into a single token, the template is used for that language and the top 10 predicted words for the [MASK] token are retrieved. The metrics I used are acc1, acc5, acc10. They indicate if the correct translation is the first predicted word, in the top 5 predicted words or in the top 10 predicted words respectively. In some cases the first predicted word is the English word itself, when that happens the second predicted word is selected to calculate acc1. For example if a language ends up with 60 words(as I already mentioned only single token words are kept) and 40 of them are in the top ten predictions when the template was used for those words, then acc10 of that language is 40/60. Below are the results for each language. Along with the acc1, acc5, acc10 scores, there is also a listing of how many tokens were used: Table 1: template results Language #tokens acc1 acc5 acc10 Turkish 199 0.336 0.537 0.603 Russian 224 0.638 0.821 0.879 Welsh 143 0.174 0.293 0.335 Farsi 202 0.103 0.396 0.519 Bulgarian 172 0.459 0.633 0.703 Tatar 103 0.058 0.223 0.281 A. Kalampokas 23

Norwegian 336 0.357 0.607 0.672 Ukrainian 150 0.653 0.813 0.853 Ket 288 0.0 0.0 0.03 Kannada 40 0.0 0.0 0.075 Albanian 156 0.121 0.275 0.346 Japanese 214 0.065 0.154 0.214 Korean 42 0.238 0.452 0.476 Portuguese 402 0.495 0.651 0.713 Latvian 107 0.289 0.401 0.439 Tamil 41 0.024 0.170 0.365 Dutch 347 0.495 0.726 0.783 Spanish 452 0.530 0.741 0.803 Hungarian 259 0.308 0.459 0.528 French 429 0.538 0.743 0.811 Language #tokens acc1 acc5 acc10 Danish 318 0.396 0.638 0.685 Arabic 191 0.157 0.350 0.486 Czech 196 0.474 0.627 0.678 Italian 352 0.568 0.755 0.840 Hebrew 158 0.329 0.481 0.594 Hindi 88 0.238 0.375 0.420 Swedish 347 0.314 0.561 0.631 Armenian 63 0.428 0.619 0.746 Georgian 57 0.298 0.578 0.649 Bengali 56 0.017 0.178 0.267 A. Kalampokas 24

Catalan 405 0.439 0.651 0.708 Lak 21 0.0 0.0 0.0 Finnish 143 0.433 0.608 0.657 Polish 163 0.515 0.631 0.687 Romanian 243 0.263 0.477 0.559 Breton 139 0.151 0.273 0.309 Malayalam 17 0.0 0.0 0.0 Latin 155 0.354 0.470 0.496 Lithuanian 77 0.311 0.428 0.506 Telugu 43 0.0 0.0 0.023 Language #tokens acc1 acc5 acc10 German 436 0.167 0.293 0.321 Croatian 189 0.433 0.582 0.629 Estonian 194 0.319 0.438 0.505 Greek 56 0.464 0.660 0.660 Slovak 186 0.451 0.618 0.688 Basque 132 0.227 0.356 0.454 Irish 132 0.090 0.189 0.227 As expected, languages with the same or similar alphabet to English (Latin ones) perform better than most. What was not expected is that Slavic (Russian and Ukrainian) languages perform the best. Hebrew, Georgean and Armenian perform well too. Low resource (Lak, Tamil etc) and Asiatic (Japanese, Korean, Bengali etc) languages, perform the worst. In addition, concerning Germanic languages, German performance was subpar compared to Dutch performance. In many cases, the list of the top 10 predicted words returns the correct translation but in another genre (for nouns / adjectives), or returns synonyms. Those do not contribute to the calculated accuracies, so if this is taken into consideration, some accuracies will improve. (though this affects more acc1, since the same word as the translation will most likely appear in the top 10 tokens if synonyms or variations of the word appear before it) A. Kalampokas 25

The code and files for the template method can be found here. 3.2 Analogies based approach The second method used for translation revolves around the idea that a word representation in mbert can be broken into a language specific and a language neutral component. Language representations ( language specific components) are calculated for each language. The translation for a word is done by taking mbert s output embedding for a word, subtracting from it the source language (English) representation and adding to it the target language representation. The result is given to mbert s MLM and the top 10 predicted words are retrieved. The same metrics were used as in the template approach (acc1, acc5,acc10). The language representations for 12 languages (English and 11 more) are calculated using the TED talks dataset. The way this is done for a language is the following: 5000 random sentences from the TED talks dataset are chosen. From each sentence a random word is chosen (as long as it does not start with #, meaning that it is a whole word and not part of a word). Note that this way the representations of the random chosen words are affected by their context, as they are part of a sentence. The mbert representations of the chosen words are averaged to get a vector representing the language. After getting those language representations the process is the same as in the template method (looping over the words in the dataset). The difference is that instead of using the template when making predictions, the procedure described above is performed (subtracting and adding language representation, to the mbert representation of an English word). The results are the following: Table 2: Analogies results Language #tokens acc1 acc5 acc10 Greek 56 0.035 0.178 0.803 Russian 224 0.017 0.174 0.696 Spanish 452 0.024 0.168 0.548 Farsi 202 0.024 0.163 0.544 A. Kalampokas 26

French 429 0.013 0.163 0.510 Italian 352 0.019 0.159 0.497 Arabic 191 0.0 0.125 0.403 Hebrew 158 0.031 0.139 0.354 Japanese 214 0.009 0.140 0.303 Korean 42 0.0 0.071 0.261 Turkish 199 0.015 0.065 0.100 Results were worse than the template method for almost every one of the tested languages, except for Greek, which reached a 80% acc10 score, Farsi and Japanese which saw minor improvements in their acc10 scores. It was also observed that results varied from acc5 to acc10 scores, much more than they did with the template method, where they mostly varied from acc1 to acc5 scores. Meaning that for this method the correct translation is likely to be in the last 5 of the 10 predicted tokens. Generally the acc1 and acc5 scores are very bad. Moreover a few examples were run, for translation between languages besides English and it had some limited success. The method was able to translate correctly some words from Greek to Russian and from French to Italian, but it did not work well in other cases like translating from a source language to Turkish and Hebrew (this might as well be a tokenizer issue, since the translations of the above languages, might need more than one token to be represented by mbert, though relatively simple, common words were chosen). The code and files for the analogies method can be found here. 3.3 POS analysis Along with the evaluation of the translation task, a part of speech analysis of the correctly translated tokens (acc10) was done. As already stated above nouns, adjectives and numbers had better performance. Other than that the part of speech analysis was not particularly illuminating, nevertheless the results are presented here: The tags used are: NN: singular common nouns NNS: Noun, plural NNP: Proper noun, singular VB: verb VBG: verb, gerund VBD: verb, past tense VBN: verb, past participle JJ: adjectives A. Kalampokas 27

JJS: adjective, superlative IN: preposition RB: adverb PRP: personal pronoun CD: cardinal number WP: wh-pronoun WRB: wh-adverb DT: Determiner CC: coordinating conjunction MD: Modal Template part of speech results The results for the part of speech analysis for the template method are presented in the table below. Note that if a cell has the N/A value, it means that there were no tokens in that language carrying that specific tag. Table 3: template POS results: Tag Greek Welsh Ket Tamil Bengali Armenian NN 24/29 32/93 1/226 9/22 9/24 21/22 NNS N/A 0/1 0/6 1/1 0/1 N/A NNP N/A 0/12 0/6 N/A 3/9 N/A VB N/A 0/2 0/10 N/A 0/5 N/A VBG N/A 0/2 N/A N/A N/A N/A VBD N/A N/A N/A N/A N/A N/A VBN N/A N/A 0/1 N/A N/A N/A JJ 0/1 8/11 0/14 1/4 1/3 11/12 JJS N/A N/A 0/1 N/A N/A N/A IN 1/2 0/2 0/2 0/1 0/3 1/3 RB 5/11 3/7 0/10 0/3 1/4 9/12 PRP N/A N/A 0/4 0/1 0/1 0/3 CD 4/6 3/4 0/4 4/6 1/3 4/4 WP N/A N/A N/A N/A 0/1 1/2 WRB 0/3 0/2 0/2 N/A N/A 0/2 DT 0/1 1/1 0/1 0/1 N/A 0/1 CC 2/2 1/2 0/1 0/2 0/2 0/2 MD N/A N/A N/A N/A N/A N/A A. Kalampokas 28

Tag Estonian Slovak Romanian Catalan Lithuanian Irish NN 55/133 82/122 78/156 202/292 20/48 18/92 NNS 0/1 N/A 1/1 3/4 1/1 N/A NNP 12/12 12/12 12/14 11/12 N/A 0/1 VB N/A 2/4 7/12 15/18 N/A 0/3 VBG N/A N/A 1/1 1/2 N/A 0/1 VBD N/A N/A N/A N/A N/A N/A VBN 1/1 1/1 N/A 0/1 N/A 0/1 JJ 9/11 8/9 12/20 21/29 N/A 4/14 JJS 0/1 N/A N/A N/A N/A N/A IN 4/5 4/5 3/4 6/6 3/5 3/4 RB 9/15 8/17 11/17 12/21 6/9 1/6 PRP 1/5 2/5 1/3 1/3 2/5 0/2 CD 4/4 4/5 6/7 8/9 2/3 2/4 WP 1/2 1/1 0/2 2/2 1/2 0/1 WRB 0/1 2/2 2/2 2/2 2/2 N/A DT 1/1 0/1 1/1 1/1 N/A 1/1 CC 1/2 2/2 1/2 2/2 2/2 1/2 MD N/A N/A N/A 0/1 N/A N/A A. Kalampokas 29

Tag Kannada Latvian Dutch Portuguese Czech Farsi NN 1/16 24/67 182/238 207/284 92/135 73/125 NNS N/A N/A 3/5 3/4 2/2 2/3 NNP N/A 5/5 12/12 2/14 2/3 0/11 VB N/A 0/1 9/13 13/19 2/3 0/3 VBG N/A N/A N/A 1/2 N/A 0/1 VBD N/A N/A 1/1 N/A N/A 0/1 VBN N/A N/A 1/1 1/1 1/1 N/A JJ 0/4 0/4 22/28 25/33 7/10 12/18 JJS N/A N/A 1/1 N/A N/A 0/1 IN 0/4 3/3 5/6 3/3 3/5 5/7 RB 0/6 7/12 17/22 13/20 9/18 3/14 PRP 0/1 2/4 3/3 4/5 2/5 0/5 CD 2/4 3/4 9/9 9/9 7/7 8/8 WP N/A 1/2 2/2 2/2 2/2 0/2 WRB 0/2 1/2 2/2 2/2 2/3 N/A DT 0/1 0/1 1/1 0/1 N/A 1/1 CC 2/2 1/2 1/2 2/2 2/2 1/2 MD N/A N/A 1/1 0/1 N/A N/A A. Kalampokas 30

Tag Malayalam Russian Korean Telugu Lak Croatian NN 0/3 126/141 14/29 0/21 0/10 79/123 NNS 0/1 3/4 N/A N/A N/A 2/2 NNP N/A 12/12 N/A 0/2 N/A 1/1 VB N/A 2/2 0/2 N/A 0/3 0/4 VBG N/A N/A N/A N/A N/A N/A VBD N/A N/A N/A N/A N/A N/A VBN N/A N/A N/A N/A N/A 0/1 JJ 0/3 13/15 4/5 0/2 N//A 4/12 JJS N/A N/A N/A N/A N/A N/A IN 0/1 4/4 0/1 1/4 0/1 4/6 RB 0/4 18/25 1/2 0/6 0/1 12/18 PRP 0/1 4/5 0/1 0/1] 0/2 2/5 CD 0/2 8/8 N/A 0/4 0/1 9/9 WP N/A 2/2 N/A N/A 0/1 2/2 WRB N/A 3/3 N/A N/A N/A 2/2 DT 0/1 1/1 1/1 0/1 0/1 0/1 CC 0/1 1/2 0/1 0/1 N/A 1/2 MD N/A N/A N/A 0/1 0/1 0/1 A. Kalampokas 31

Tag Basque Italian Finnish Spanish French Georgian NN 39/88 209/250 48/79 256/329 246/309 13/17 NNS 1/2 4/4 1/1 4/4 4/4 0/1 NNP N/A 13/13 N/A 15/15 13/13 10/11 VB 4/10 7/10 2/15 16/21 13/17 N/A VBG 0/1 1/1 N/A 3/3 2/2 N/A VBD N/A 1/1 N/A 1/1 1/1 N/A VBN N/A N/A 1/1 0/1 0/1 N/A JJ 3/5 23/26 9/10 30/34 28/34 6/7 JJS 0/1 N/A 1/1 N/A 1/1 N/A IN 2/3 6/8 5/5 5/6 4/5 0/1 RB 6/9 12/17 14/23 14/18 16/19 4/10 PRP 0/3 4/5 2/5 3/3 4/5 0/3 CD 3/5 9/9 8/8 9/9 9/9 3/3 WP 0/1 1/1 1/1 2/2 2/2 N/A WRB 0/1 3/3 0/1 N/A 3/3 N/A DT 0/1 1/1 1/1 1/1 0/1 0/1 CC 2/2 2/2 1/2 2/2 2/2 1/2 MD N/A 0/1 N/A 0/1 0/1 0/1 A. Kalampokas 32

Tag Japanese Breton Tatar Polish Albanian Hebrew NN 45/198 28/94 17/59 72/109 25/94 63/105 NNS 0/2 2/3 1/3 0/1 0/1 1/2 NNP N/A 0/5 10/11 1/1 11/13 N/A VB 0/1 0/5 0/3 N/A 0/4 1/5 VBG 0/1 0/1 N/A N/A N/A 0/2 VBD N/A N/A N/A N/A N/A N/A VBN N/A N/A 0/1 1/1 0/1 N/A JJ N/A 5/10 1/6 7/9 5/14 9/14 JJS 0/1 N/A N/A N/A N/A N/A IN 0/1 2/4 0/4 4/5 0/3 3/6 RB 1/5 2/5 0/8 15/21 5/8 7/11 PRP 0/2 0/4 0/4 1/4 3/5 3/4 CD 0/1 3/4 0/2 6/6 3/6 3/4 WP 0/1 0/1 N/A 1/1 N/A 1/2 WRB N/A 0/1 N/A 1/2 0/3 N/A DT N/A N/A 0/1 1/1 0/1 1/1 CC 0/1 1/2 N/A 1/2 1/2 2/2 MD N/A N/A 0/1 N/A 1/1 N/A A. Kalampokas 33

Tag Hungarian Bulgarian Norwegian Turkish Hindi Latin NN 78/176 82/112 144/230 73/133 19/47 61/107 NNS 2/2 2/3 3/3 1/3 1/1 N/A NNP 12/12 12/12 12/12 13/13 4/8 0/6 VB 2/3 N/A 6/15 0/1 0/2 1/2 VBG 1/1 N/A N/A N/A N/A 0/1 VBD N/A N/A N/A N/A N/A N/A VBN N/A 1/1 1/1 N/A N/A N/A JJ 18/22 5/9 18/23 17/20 4/7 2/6 JJS N/A N/A 0/1 N/A N/A N/A IN 4/6 2/6 5/6 2/3 1/2 5/6 RB 9/16 10/15 20/24 8/11 1/8 4/14 PRP 2/5 1/3 2/4 1/4 1/4 1/4 CD 5/7 5/6 8/9 5/6 4/5 3/4 WP 2/2 0/1 2/2 0/2 0/1 N/A WRB 0/1 0/1 2/2 0/1 N/A 0/2 DT 1/1 0/1 1/1 0/1 0/1 0/1 CC 1/2 1/2 2/2 0/1 2/2 0/1 MD 0/1 N/A 0/1 N/A N/A 0/1 A. Kalampokas 34

Tag German Ukrainian Swedish Arabic Danish NN 50/306 86/96 137/245 66/132 141/220 NNS 2/4 2/3 3/3 0/1 3/3 NNP 0/13 N/A 12/12 6/6 12/12 VB 12/17 0/1 4/9 2/12 6/13 VBG 0/2 N/A N/A N/A N/A VBD 1/1 N/A N/A N/A N/A VBN 1/1 N/A 1/1 N/A 1/1 JJ 24/29 9/9 22/27 3/11 16/21 JJS 0/1 N/A N/A N/A 0/1 IN 8/10 3/4 5/5 2/4 5/5 RB 22/29 14/20 19/24 6/14 17/20 PRP 4/5 4/5 2/5 1/2 2/5 CD 9/9 5/5 8/8 4/4 8/9 WP 2/2 1/2 2/2 0/2 2/2 WRB 3/3 1/2 1/2 N/A 2/2 DT 0/1 1/1 1/1 1/1 1/1 CC 2/2 2/2 2/2 2/2 2/2 MD 0/1 N/A 0/1 N/A 0/1 A. Kalampokas 35

Analogies part of speech results Table 4: Analogies POS results Tag Greek Arabic French Korean Farsi Russian NN 24/29 48/132 162/309 7/29 71/125 98/141 NNS N/A 0/1 1/4 N/A 0/3 3/4 NNP N/A 6/6 8/13 N/A 6/11 7/12 VB N/A 2/12 5/17 0/2 0/3 2/2 VBG N/A N/A 2/2 N/A 0/1 N/A VBD N/A N/A 0/1 N/A 0/1 N/A VBN N/A N/A 0/1 N/A N/A N/A JJ 0/1 3/11 17/34 2/5 8/18 11/15 JJS N/A N/A 1/1 N/A 0/1 N/A IN 1/2 2/4 2/5 0/1 4/7 2/4 RB 10/11 10/14 10/19 1/2 10/14 18/25 PRP 0/1 0/2 1/5 0/1 4/5 4/5 CD 5/6 3/4 5/9 N/A 5/8 7/8 WP N/A 1/2 1/2 N/A 0/2 1/2 WRB 2/3 N/A 1/3 N/A N/A 2/3 DT 1/1 1/1 0/1 1/1 1/1 1/1 CC 2/2 2/2 2/2 0/1 2/2 1/2 MD N/A N/A 0/1 N/A N/A N/A A. Kalampokas 36

Tag Italian Japanese Spanish Turkish Hebrew NN 137/250 64/198 189/329 13/133 32/105 NNS 1/4 0/2 1/4 0/3 1/2 NNP 4/13 N/A 5/15 0/13 N/A VB 3/10 0/1 9/21 0/1 2/5 VBG 0/1 0/1 2/3 N/A 0/2 VBD 1/1 N/A 1/1 N/A N/A VBN N/A N/A 0/1 N/A N/A JJ 14/26 N/A 17/34 5/20 3/14 JJS N/A 1/1 N/A N/A N/A IN 3/8 0/1 3/6 0/3 2/6 RB 8/17 1/5 10/18 0/11 7/11 PRP 0/5 1/2 0/3 1/4 2/4 CD 2/9 1/1 5/9 1/6 3/4 WP 0/1 0/1 1/2 0/2 1/2 WRB 1/3 N/A 1/2 0/1 N/A DT 1/1 N/A 0/1 0/1 1/1 CC 1/2 0/1 2/2 0/1 2/2 MD 0/1 N/A 0/1 N/A N/A A. Kalampokas 37

4. VISUALIZATION In this chapter, the tokens/words of each language are plotted on a two dimensional space. For each word, two components that capture the variation of the corresponding vector (each token is represented by mbert with a 768 dimensional vector) are extracted using the tsne algorithm. These components are then used for the plot. By visualizing all the words in such a manner, it is possible to make correlations about words belonging in the same language, or a similar one. Words of the same language are clustered together. Clusters of similar languages tend to end up close to each other. Therefore mbert s way of representing the words of different languages is sufficient to capture the language identity, even for low resource languages with only a handful of tokens contained in mbert s vocabulary. After describing the workings of the tsne algorithm and the process followed to plot the tokens (the MASK token from the template approach is used), figures for each language s plotted tokens are presented along with some commentary. 4.1 The tsne algorithm t-distributed Stochastic Neighbor Embedding (tsne) is an unsupervised non linear technique primarily used for visualizing high dimensional data [15]. It is similar to the PCA algorithm (principal component analysis), which tries to find the dimensions (components) on which the dataset shows the most variance. Basically tsne works by measuring the similarities of datapoints in the high dimensional space. Then for each datapoint it centers a Gaussian distribution over it. After measuring the density of all the points under that distribution and renormalizing, it finds a set of probabilities for all datapoints that are proportional to the computed similarities. After that it uses Student-t distribution with one degree of freedom, which yields a set of probabilities in the low dimensional space. Finally it tries to make the probabilities from the low dimensional space to reflect those from the high dimensional space as best as possible. By running tsne on high dimensional representations / vectors, it is possible to obtain two or more components that sufficiently capture the original high dimensional representations patterns. These components might then be used for visualization or other purposes. A. Kalampokas 38

4.2 Representation plots The tsne algorithm was used to plot the [MASK] token last layer representations of the words in the dataset (single tokens only), from the template method, for all languages, in 2 dimensions. Essentially, tsne took as input the 768-dimensional representations, and returned the two components, which were used to plot the representations. These representations clearly cluster according to language. Although there are cases in which the representations from different languages overlap. This happens with similar languages (e.g., Russian and Ukrainian, Spanish and Portoguese, etc.) and it was expected behaviour. There are 4 cases. Either the words of a language form a clearly separate cluster from others. And this cluster is not close to other clusters They form a separate cluster, but it is close to clusters of similar languages (like Italian and French) A cluster is formed by many languages whose words overlap. A cluster is formed by many languages, but their words do not overlap. The plots for the words of each language are presented one by one below, along with observations made: Figure 4: Estonian A. Kalampokas 39

Figure 5: Latvian Figure 6: Lithuanian These are the plots for the languages of the Baltic states. Latvian and Lithuanian form a cluster at the top left. Although their words do not overlap. Half the cluster is Lithuanian, the other half is Latvian. Estonian forms a cluster on its own. A. Kalampokas 40

Figure 7: French Figure 8: Italian A. Kalampokas 41

Figure 9: Spanish Figure 10: Portuguese A. Kalampokas 42

Figure 11: Catalan Figure 12: Romanian A. Kalampokas 43

Figure 13: Latin The Latin languages clusters are close to one another. Spanish and Portuguese overlap but the others have separate clusters, with the exception that there are some words common to several or all Latin languages as can be seen from the plots. Figure 14: Bengali A. Kalampokas 44

Figure 15: Hindu Figure 16: Kannada A. Kalampokas 45

Figure 17: Lak Figure 18: Malayalam A. Kalampokas 46

Figure 19: Tamil Figure 20: Telugu A. Kalampokas 47

Here are the languages spoken in the Indian peninsula. Bengali, Hindi, Kannada, Lak, Tamil,Telugu and Malayalam. They all overlap with each other and form a cluster which is very close to another cluster (on the right) formed by the overlapping Farsi and Arabic languages. Figure 21: Arabic Figure 22: Farsi Arabic and Farsi overlap and form a cluster, as mentioned above. A. Kalampokas 48

Figure 23: Danish Figure 24: Finnish A. Kalampokas 49

Figure 25: Norwegian Figure 26: Swedish For the languages of the Nordic countries, there are 2 separate clusters which are close together. A cluster for Finnish and a cluster from the overlapping Danish, Swedish and Norwegian languages. These clusters are also close to the Estonian cluster. A. Kalampokas 50

Figure 27: Bulgarian Figure 28: Russian A. Kalampokas 51

Figure 29:Ukrainian Next we have 3 Slavic languages, Bulgarian, Russian and Ukrainian, that make up a big cluster. Russian overlaps with both, Ukrainian and Bulgarian, but the last two do not overlap so much with one another. This cluster is close to Romanian, Greek and Armenian. A. Kalampokas 52

Figure 30: Croatian Figure 31: Czech A. Kalampokas 53

Figure 32: Slovak Figure 33: Polish And here are presented the plots of 4 other Slavic languages, Croatian, Czech, Slovak and Polish. The first 3 overlap together and form a cluster, while Polish has its own cluster, slightly above the others. A. Kalampokas 54

Figure 34: Armenian Figure 35: Greek The clusters of Greek and Armenia are close to each other.they are both close to the Slavic cluster of Russian, Bulgarian and Ukrainian as well as the Farsi/Arabic cluster.also Greek is close to Latin and Romanian, while Armenian is close to the dialects of India. A. Kalampokas 55

Figure 36: Dutch Figure 37: German A. Kalampokas 56

Dutch and German form clearly separate clusters and they are close to one another unsurprisingly, since they both are Germanic languages. Figure 38: Tatar Figure 39: Turkish Tatar and Turkish form a cluster together, though their words do not overlap. The upper part is Tatar, the lower is Turkish. A. Kalampokas 57

Figure 40: Basque Figure 41: Breton A. Kalampokas 58

Figure 42: Hungarian Figure 43: Japanese A. Kalampokas 59

Figure 44: Korean Figure 45: Irish A. Kalampokas 60

Figure 46: Welsh The languages of Basque, Briton, Hungarian, Japanese, Korean, Irish and Welsh form their own clusters, with no other languages especially close to them. This is to be expected since they are unique compared to all others. A. Kalampokas 61

Figure 47: Hebrew Figure 48: Georgian A. Kalampokas 62

Figure 49: Albanian Figure 50: Ket A. Kalampokas 63

Hebrew, Georgian, Ket and Albanian have separate clusters with no overlap from other languages. Strangely enough, Georgian and Hebrew clusters are close, same goes with Ket and Albanian, although they share no common alphabet or roots. Fun fact: the number of native Ket speakers as of 2020 is only 20 people! And here are all the languages together: Figure 51: All languages The code and files for the visualization can be found here. A. Kalampokas 64

5. CONCLUSION The main object of this thesis was recreating the It s not Greek to mbert paper [13], extending it for more languages and doing a more comprehensive visualization. Much kudos to the authors of the paper for doing such interesting research. All in all, for both methods the results for the translation task were pretty decent, even for many lower resource languages. The analogies method was also capable (to some extent) of translation between languages besides English. However a serious limitation is mberts limited vocabulary. The t-sne plots showed that language identity can be captured by the mbert representations and aside from a few unexpected patterns (e.g., Georgian and Hebrew representation being close, though there could be syntactic similarities, the writer speaks neither of these languages ), everything else confirmed this. A. Kalampokas 65

APPENDIX Analogies translations in other languages As mentioned, some tests were performed with translation between languages besides English using the analogies method. Here are some examples (they can be also found in the code for the analogies method) along with the results, meaning the top ten predicted tokens: Table 5: Translations between other languages word source target Predictions/ results Correct translation padre Spanish French ['fratello', 'πατερα', 'ojciec', 'vater', 'отца', 'padres', 'отец', 'madre', 'father', 'padre'] père (not found) οικογένεια Greek Russian,'عايلة', '##משפחת ' 'семьи', ['famiglia', 'семеиство', 'प रव र', 'породице', 'семья',,'משפחת' 'οικογενεια'] fille French Italian ['sœur', 'fiica', 'hija', 'κορη', 'tochter', 'filles', 'femme', 'filla', 'figlia', 'fille'] ένα Greek Farsi ['ਇਕ', 'एक', 'এক ',,'یکی' 'ενος', 'μια', 'ενας', 'εναν',,'یک' 'ενα'] bueno Spanish Japanese [' なお ', 'андреевич', ' しい ', ' 檎 ', ' ', 'bueno'] ' 爺 ', ' 険 ', 'buenas',,'וטין##' nuit French Turkish ['journee',,'לילה##' 'nacht', 'soiree', 'noche', 'ночь', 'noite', 'nocy', 'notte', 'nuit'] семьи (found) figlia (found) یک (found) しい (found) gece (not found, though mbert s vocabulary might not have it) A. Kalampokas 66

mbert s vocabulary for Greek mbert s limited vocabulary is a big obstacle when performing such tasks. For Greek only 56 out of more than 1000 English words had their Greek translations being tokenized into a single token. This means that only 56 Greek translations appeared as they are on mbert s vocabulary. To give the reader an idea of mbert s vocabulary contents and the limitations it enforces, the Greek tokens that are part of it are shown below. There are only 491 of them, including the letters of the Greek alphabet: [' ', 'α', 'β', 'γ', 'δ', 'ε', 'ζ', 'η', 'θ', 'ι', 'κ', 'λ', 'μ', 'ν', 'ξ', 'ο', 'π', 'ρ', 'ς', 'σ', 'τ', 'υ', 'φ', 'χ', 'ψ', 'ω', 'ϕ', 'του', 'και', 'το', 'της', 'την', 'απο', 'με', 'τον', 'να', 'που', 'στην', 'των', 'σε', 'στο', 'για', 'ειναι', 'τα', 'οι', 'τη', 'τους', 'ηταν', 'στη', 'ως', 'στις', 'τις', 'μια', 'ενα', 'στον', 'στα', 'δεν', 'οτι', 'κατα', 'αλλα', 'εχει', 'ειχε', 'μετα', 'δυο', 'ενω', 'οπως', 'οποια', 'θα', 'επισης', 'αυτο', 'οπου', 'καθως', 'οταν', 'οποιο', 'μεχρι', 'χρονια', 'αυτη', 'στους', 'μεταξυ', 'προς', 'εχουν', 'πρωτη', 'εγινε', 'αν', 'θεση', 'πιο', 'μπορει', 'ενας', 'ομως', 'ομαδα', 'πολη', 'περιοχη', 'οποιος', 'ονομα', 'αιωνα', 'μονο', 'πολυ', 'περιοδο', 'καθε', 'μαζι', 'βρισκεται', 'συμφωνα', 'διαρκεια', 'αργοτερα', 'ετσι', 'τοτε', 'μερος', 'ενος', 'ειχαν', 'μεσα', 'πως', 'περιπου', 'σημερα', 'εναν', 'πρωταθλημα', 'λογω', 'β ', 'υπο', 'α ', 'εως', 'πρωτο', 'πριν', 'φορα', 'πανω', 'κυριως', 'αθηνα', 'νεα', 'ελλαδα', 'ωστοσο', 'μεγαλη', 'αυτα', 'ταινια', 'συνεχεια', 'χωρις', 'εργο', 'εθνικη', 'σειρα', 'επι', 'αποτελει', 'μιας', 'χωριο', 'ιστορια', 'τμημα', 'γεννηθηκε', 'εν', 'ομαδες', 'υπαρχουν', 'ιδια', 'πολλες', 'υπαρχει', 'εκει', 'σαν', 'φορες', 'τελος', 'αποτελεσμα', 'γιος', 'εκτος', 'τρεις', 'μεσω', 'κατω', 'πολης', 'κοινοτητα', 'μελος', 'πολλα', 'συστημα', 'αλλες', 'ετων', 'αγωνες', 'εποχη', 'αφου', 'αυτες', 'εδρα', 'αρχικα', 'χωρα', 'ωστε', 'εκ', 'τελικα', 'ακομα', 'ολα', 'μη', 'μεγαλο', 'εκανε', 'τοσο', 'ξεκινησε', 'κοντα', 'δε', 'τιτλο', 'κατηγορια', 'παρα', 'ιδιο', 'δηλαδη', 'αγιου', 'υπηρξε', 'ακομη', 'πρεπει', 'αρχες', 'κεντρο', 'αναφερεται', 'ρολο', 'ελληνικη', 'χρηση', 'οσο', 'αναμεσα', 'ελαβε', 'εργα', 'κι', 'βαση', 'θανατο', 'προεδρος', 'γυρω', 'χρονο', 'ηλικια', 'μορφη', 'οκτωβριου', 'οποιες', 'συνηθως', 'κυβερνηση', 'πηρε', 'ομαδας', 'μαιου', 'δευτερη', 'μελη', 'πολεμου', 'περι', 'πολεμο', 'ζωη', 'συχνα', 'ιουλιου', 'πρωτος', 'κατοικους', 'οχι', 'αυγουστου', 'τελευταια', 'εκλογες', 'μαρτιου', 'κορη', 'βασιλια', 'σεπτεμβριου', 'πλεον', 'απριλιου', 'ιουνιου', 'περιοχες', 'πεθανε', 'ονομασια', 'μαχη', 'αυτος', 'αλμπουμ', 'γινεται', 'ελλαδας', 'πανεπιστημιο', 'επειδη', 'γλωσσα', 'βορεια', 'μπορουν', 'σχετικα', 'ιανουαριου', 'νοεμβριου', 'πρωην', 'ελληνας', 'εκδοση', 'νοτια', 'οικογενεια', 'αλλη', 'αγωνα', 'αθηνων', 'κυπελλο', 'ιδιαιτερα', 'ειτε', 'στοιχεια', 'γεγονος', 'κυκλοφορησε', 'θεωρειται', 'δεκεμβριου', 'δημος', 'ντε', 'γ ', 'δευτερο', 'ολες', 'σχεδον', 'γινει', 'συμμετειχε', 'αλλων', 'μεγαλυτερη', 'παιδια', 'νεο', 'εκεινη', 'χωρας', 'βασιλιας', 'μεγαλυτερο', 'μας', 'αρχισε', 'χρησιμοποιειται', 'εναντιον', 'ενωση', 'διαφορα', 'τρια', 'ιδρυθηκε', 'εκκλησια', 'οποιοι', 'σχεση', 'φεβρουαριου', 'τροπο', 'ηπα', 'πρωτα', 'κομμα', 'παραδειγμα', 'περισσοτερο', 'μουσικη', 'διαφορες', 'βραβειο', 'κοσμο', 'κανει', 'επειτα', 'περιοχης', 'καποια', 'μετρα', 'εθνικης', 'ιδιος', 'αρκετα', 'γνωστο', 'ετος', 'σχολη', 'χωρες', 'χωρο', 'γνωστη', 'πολιτικη', 'πεντε', 'αυτου', 'δημου', 'τραγουδι', 'δυναμεις', 'δημο', 'πατερας', 'λιγο', 'παραγωγη', 'αυτων', 'ιωαννης', 'σελ', 'δεκαετια', 'ηδη', 'κερδισε', 'σημειο', 'διαστημα', 'πατερα', 'αριθμος', 'σεζον', 'ζωης', 'ευρωπη', 'παντρευτηκε', 'εταιρεια', 'παραλληλα', 'επιτυχια', 'μηκος', 'ανελαβε', 'μολις', 'τελη', 'ολη', 'ατομα', 'βιβλιο', 'γαλλια', 'βρισκονται', 'απογραφη', 'κατι', 'κατοικοι', 'ορος', 'αλλο', 'χιλιομετρα', 'ποτε', 'εποχης', 'σωμα', 'επαρχια', 'περιλαμβανει', 'αυτον', 'πληθυσμο', 'συμμετοχη', 'παγκοσμιο', 'πισω', 'πρωτευουσα', 'περιπτωση', 'εγιναν', 'μαρια', 'συνολο', 'οποιων', 'βασιλειο', 'επιπλεον', 'νησι', 'εξης', 'αυτης', 'ανηκει', 'συνολικα', 'αναπτυξη', 'μονη', 'γερμανια', 'ελληνικης', 'χλμ', 'επιπεδο', 'αλλους', 'γνωστος', 'κατασταση', 'μαλιστα', 'αριθμο', 'μm', A. Kalampokas 67