Machine translation: what is good and what is bad

Machine translation is one of the most known and demanded NLP applications that thousands of millions of people use on a daily basis. At the same time, the voice of criticism by professional translators is clearly to hear. We asked several translators with MT experience about limitations of MT and in which areas MT providers and translators can help each other to have a better MT experience.

.

 

 

 

Machine translation is one of the most known and demanded NLP applications that thousands of millions of people use on a daily basis. At the same time, the voice of criticism by professional translators is clearly to hear. In this post, we summarize feedback from translators collected by a German MT provider and understand what they like and what they do not like when post-editing machine translation output.

 

We pre-selected some examples demonstrating prominent errors for different language pairs and tried to analyze why statistical MT engines make these mistakes and what can be done to fix them.

 

Example 1 (De-En)

Source: Zeitweilig kamen über 90 % der Professor(inn)en nicht aus der eigentlichen Informatik 

MT output: Residing, over 90% of professors is not from the Computer Science 

Post-edited output: At times over 90% of professors are not from a Computer Science background

Issues and ways to fix them:

  1. Zeitweiligis mistranslated. 

Issue 1 is explained by the lack of data used to build the system. The statistical model underlying the machine translation mechanism does not provide enough statistics to the engine to find adequate translation for the word “zeitweilig”. 

  1. professors is not“ is an ungrammatical translation. 

Issue 2 is more interesting: MT engines usually operate within a relatively narrow window of words that did not allow them to select a proper English verb form according to the German verb “kamen”. The latter appears five position ahead of “Professor(inn)en“ that made the verb-noun dependency statistically unreliable. So, the engine preferred to take decision from the word “nicht” and surrounding context. However, uncertainty (from a statistical point of view) of the word form “Professor(inn)en” led to wrong translation “professors is not”.

  1. background“ is omitted. Machines do not think like humans.

The MT engine could not realize that it has to add “background” after “Computer Science”. These situations are quite difficult for MT and can be resolved only using extra clean data to enhance the system.

 

Example 2 (De-Es)

Source: Die Arbeiten können auch mit anderen Servicegeräten, die über die Funktion „3D-Modus“ verfügen, durchgeführt werden. 

MT output: Los trabajos se pueden modificar también con otras Servicegeräten a través de la función «encender», se realiza. 

Post-editing output: Los trabajos también se pueden llevar a cabo con otros aparatos para el mantenimiento del sistema que cuenten con la función «Modo 3D». 

Issues and ways to fix them:

  1. Servicegeräten“ is not translated. Although found in the terminology database for this client, the word “Servicegerät” was not properly translated from German. The reason is that, while the singular form of the word (“Servicegerät”) can be properly translated by the system, the plural form (“Servicegeräten”) is not recognized by the engine and left untranslated. Next iterations of statistical MT system will account for different morphological variations of the terminology instances.

  2. 3D-Modus“ is mistranslated. The expression “3D-Modus” was translated as “encender” because of the lack of informative context. Statistical MT takes translation decisions on the basis of the surrounding context. In this case, quotation marks in combination with the words on the left and on the right are confusing the engine as the system has not often encountered these during the training process. More high-quality data is needed to improve the translation process.

 

Example 3 (De-Es)

Source: Siehe Beschreibung ab Seite 214

MT output: Véase la descripción de la página en adelante. 

Post-edited output: Véase la descripción de la página 214 en adelante. 

Issues and ways to fix them:

1. The figure “2” was omitted after the word “Seite”.

One of the components of any MT system based on statistical principles is a language model of the target language (Spanish, in this example) controlling the fluency of the output. In this case, the imperfection of the model led to the wrong translation without “214” after the word “la página”. This error is relatively easy to fix involving more monolingual information in the target language to improve the quality of the language model.

 

 

Some of the errors reported by translators have already been corrected in the current version of the MT system and will never be seen again. Others will require more time, data and development effort to be fixed.

 

We hope this blog post will give you a better understanding of MT processes and will help you understand which research and development areas of MT can be promising to focus on in the near future.

 

Source: bmmt blog.

 

 

 

Blog

Leave a Reply