Starting your own NLP business – part 2

In this post we continue presenting the main groups of factors that influence any young NLP company by example of a Machine Translation startup in Europe. Here, we will cover the following areas: Technology&Research, Data, Evaluation, Resources and Translators & Post-editors. Please refer to the previous post for the rest of the discussion.





Technology and research


Commercial application of MT is many-sided. The consumer market is mostly aware of general-purpose free online MT systems, translation buyers and translation agencies are mostly interested in post-edited MT. Another application is near real-time contextual conversations (chats).

We believe that the future of MT lies in integration of the three applications in a unified translation frameworks that would allow to cover different types of content: publishable content (combined with PE services allows to reduce the translation cost), low pageview content (extended localization: translate content that would not be translated otherwise) and instant communication.

The choice of a particular MT technology (statistical, rule-based or hybrid MT), a customization mechanism, a security level (on-premise, on a bmmt server or on a cloud), a level of integration with CAT tools or IT infrastructure, training and consultancy – all these features can be either chosen by customers or simply recommended by experts for each particular scenario of MT adoption.




Modern MT is highly dependent on both parallel and monolingual data. While the technology is the main driving factor moving forward MT as a product, right data is often more important for system performance than the technology itself.

The primary and best-fit source of bilingual information to train an MT system are internal and customers’ translation memories (TMs). A common intuition holds that more data yields improved translation accuracy. However, it’s important to remember that data are more than just parallel TMs. Data are also out-of-domain and in-domain monolingual corpora, terminology, glossaries, named entity, etc. All these linguistic components can contribute to the final MT product.

As an alternative to TMs, parallel data can be aggregated from shared repositories. For some MT projects, open corpora (mostly used in academia) or even data aggregated from the Web (IP rights is an open question), can be also used to enhance existent MT systems.




MT quality evaluation is a hot topic for many if not all MT adopters. By contrast with academia, in industrial settings, the primary metric of MT quality is human judgment (productivity testing and error analysis for post-edited MT, adequacy/fluency for gisting), while automatic evaluation is mostly used for preliminary evaluation.

There are so many methods for translation quality evaluation available these days that one should be “a picky eater” and being patient enough to choose the one that meets their needs.

Currently, we recommend our customers to rely on productivity testing as the most reliable way to define post-editing compensation, track the progress of system, ROI and TOC for post-edited and non-gisting scenarios. We believe that the future of translation quality evaluation lies in reliable methods of confidence estimation that will allow to automatically predict effort associated with post-editing of MT output.




Building a decent MT system in commercial settings is impossible without considering four types of external and internal resources:

  • Hardware. There are three possible scenarios of MT structure organization:
  • On-premise: MT is installed and maintained on the customer’s servers;
  • Cloud-based: MT systems in installed and maintained on the shared cloud;
  • On the provider’s servers: MT systems in installed and maintained on the shared cloud;
  • Combined: a middle ground scenario where MT system is located on the cloud or on the provider’s servers, but data is transferred via secure channels.

Software. It is not a secret that the majority of modern MT solutions rely on freely available O/S software, like Moses. Moses is not straightforward to install and operate and it is per se a part of the solution only. MT integration will inevitably require development or purchasing of other software components, like file format converters, tag handlers, etc.

Human resources. The recipe of successful MT implementation without fail includes a set of expertise (in-house, client side or consultancy): core MT expertise, general computational linguistics expertise, language expertise and IT expertise.

Knowledge and best practices. MT is a new technology and there are typically more than one possible solutions to the problems at all stages of MT implementation. Besides, MT is a data-driven approach: consequently empirical (i.e. time-consuming) approach is the best and only way to identify the best path to go in many situations. That is why access to BPs and MT knowledge base is crucial for success.

Translators & post-editors. Professional MT in the majority of real-world use scenarios is bundled with post-editing services. Post-editing is typically done by humans who are or used to be professional translators. The post-editor’s and translator’s skillsets overlap only partially. Many MT adopters experienced difficulties changing translators/post-editors’ attitude to MT to positive. Another challenge is to motivate them to give flexible and system-oriented constructive feedback essential to improve the quality of MT engines.

One (and the best, as we think) approach to change the attitude in an LSP scenario is to (1) make them aware of general principles, tips and tricks of general and language-dependent post-editing, (2) increase internal knowledge of MT in general and (3) develop a flexible compensation scheme based on the productivity increase with a possibility of a postponed compensation decision.

The world where NLP startups are currently living is complex: it includes a lot of puzzling elements and dark corners. Some of them are harmful, others lead to triumph. MT is a perfect example of how a cutting-edge research has transformed to the great technology demanded by translation buyers, providers of translation services and end users. That is why simplicity, transparency and collaboration are three cornerstone towards success.

Originally published



Leave a Reply