How do I create a customised machine translation engine?

The press releases and general hype around machine translation issued by IT giants like Google, Skype/Microsoft and IBM has increased the expectations of machine translation (MT) in recent years. Such PR work has led even the BBC to question whether human translators are facing the end of the line. This is only opposed by a whole range of disasters arising from MT that are equally well-known on the internet (here are a few examples).

So which is true? Is MT the permanent and modern-day solution to the problems apparently caused by the Tower of Babel or is it simply over-hyped nonsense that is practically useless for business use today?

Of course the situation does not fit nicely into either one of these extreme categories – rather the current state of technology is somewhere in the middle.

It is of course true that technological advances have improved the results of MT in recent years. For example, Google Translate and Skype’s translator can give you something of the gist of the text. They are built on huge quantities of data thus creating statistically-based models of what a word is most likely to mean in any one context. Nevertheless, the hype seems to promise this in all languages and for all purposes. And this is where it falls down. The data used to train such engines is often IT-focused so handles such content really well – because it’s familiar with it. But feed in something it’s unfamiliar with – such as regional speech (Liverpudlian, Glaswegian), poetry or a novel – and you quickly realise the limits of the technology. In addition, training data is not available in the huge quantities for unusual (long-tail) language pairs. So if you want Google Translate or other free, generic MT systems to translate e.g. Korean to German it may well first have to go through English to get there. So if you add up the statistical chances of getting a word right from Korean to English and then again from English to German, your chances of winning the UK lottery may actually be better than that of a machine producing a good translation.

In fact, the current technology does much better when trained on large amounts of data in a specific language pair and for a specific purpose. For example, our new MT engine for financial reporting was trained on over 2 million bilingual (German>English) segments (sentences or parts of sentences) from past translations, publicly available data from past annual reports and the IAS/IFRS accounting regulations. This means it does one thing – and one thing only: German>English financial reports. And it does it very well indeed.

So if you are thinking about MT for your translation workflows, here are some points to consider if you want to customise your own machine translation engine:

Do you have a large amount of data in one language pair and for a very specific domain – the more specific the better?
Is there a lot of publicly available bilingual and monolingual data for this specific domain?
Do you have the expertise in-house to customise the machine translation engine? Or can you find a service provider to do this for you?

If you need assistance on any of these steps, we’ll be happy to help you generate a customised machine translation for your purposes. Simply drop me a line:
Gill [at] linguagloss.com

Source: LinkedIn

NLP News