Major trends in NLP: a review of 20 years of ACL research
The 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019) is starting this week in Florence, Italy. We took the opportunity to review major research trends in the animated NLP space and formulate some implications from the business perspective. The article is backed by a statistical and – guess what – NLP-based analysis of ACL papers from the last 20 years.
When compared to other species, natural language is one of the primary USPs of the human mind. NLP, a major buzzword in today’s tech discussion, deals with how computers can understand and generate language. The rise of NLP in the past decades is backed by a couple of global developments – the universal hype around AI, exponential advances in the field of Deep Learning and an ever-increasing quantity of available text data. But what is the substance behind the buzz? In fact, NLP is a highly complex, interdisciplinary field that is constantly supplied by high-quality fundamental research in linguistics, math and computer science. The ACL conference brings these different angles together. As the following chart shows, research activity has been flourishing in the past years:
Figure 1: Paper quantity published at the ACL conference by years
In the following, we summarize some core trends in terms of data strategies, algorithms, tasks as well as multilingual NLP. The analysis is based on ACL papers published since 1998 which were processed using a domain-specific ontology for the fields of NLP and Machine Learning.
2. Data: working around the bottlenecks
The quantity of freely available text data is increasing exponentially, mainly due to the massive production of Web content. However, this large body of data comes with some key challenges. First, large data is inherently noisy. Think of natural resources such as oil and metal – they need a process of refining and purification before they can be used in the final product. The same goes for data. In general, the more “democratic” the production channel, the dirtier the data – which means that more effort has to be spent on its cleaning. For example, data from social media will require a longer cleaning pipeline. Among others, you will need to deal with extravagancies of self-expression like smileys and irregular punctuation, which are normally absent in more formal settings such as scientific papers or legal contracts.
The other major challenge is the labeled data bottleneck: strictly speaking, most state-of-the-art algorithms are supervised. They not only need annotated data – they need Big Labeled Data. This is especially relevant for the advanced, complex algorithms of the Deep Learning family. Just as a child’s brain first needs a max of input before it can learn its native language, to go “deep”, an algorithm first needs a large quantity of data to embrace language in its whole complexity.
Traditionally, training data at smaller scale has been annotated manually. However, dedicated manual annotation of large datasets comes with efficiency trade-offs which are rarely acceptable, especially in the business context.
What are the possible solutions? On the one hand, there are some enhancements on the management side, incl. crowd-sourcing and Training Data as a Service (TDaaS). On the other hand, a range of automatic workarounds for the creation of annotated datasets have also been suggested in the machine learning community. The following chart shows some trends:
Figure 2: Discussion of approaches for creation and reuse of training data (amounts of mentions normalised by paper quantity in the respective year)
Clearly, pretraining has seen the biggest rise in the past five years. In pre-training, a model is first trained on a large, general dataset and subsequently tweaked with task-specific data and objectives. Its popularity is largely due to the fact that companies such as Google and Facebook are making huge models available out-of-the-box to the open-source community. Especially pre-trained word embeddings such as Word2Vec, FastText and BERT allow NLP developers to jump to the next level. Transfer learning is another approach to reusing models across different tasks. If the reuse of existing models is not an option, one can leverage a small quantity of labeled data to automatically label a larger quantity of data, as is done in distant and weak supervision – note, however, that these approaches usually lead to a decrease in the labeling precision.
3. Algorithms: a chain of disruptions in Deep Learning
In terms of algorithms, research in recent years has been strongly focussed on the Deep Learning family:
Figure 3: Discussion of Deep Learning algorithms (amounts of mentions normalised by paper quantity in the respective year)
Word embeddings are clearly taking up. In their basic form, word embeddings were introduced by Mikolov et al. (2013). The universal linguistic principle behind word embeddings is distributional similarity: a word can be characterized by the contexts in which it occurs. Thus, as humans, we normally have no difficulty completing the sentence “The customer signed the ___ today” with suitable words such as “deal” or “contract”. Word embeddings allow to do this automatically and are thus extremely powerful for addressing the very core of the context awareness issue.
While word2vec, the original embedding algorithm, is statistical and does not account for complexities of life such as ambiguity, context sensitivity and linguistic structure, subsequent approaches have enriched word embeddings with all kinds of linguistic information. And, by the way, you can embed not only words, but also other things such as senses, sentences and whole documents.
Neural Networks are the workhorse of Deep Learning (cf. Goldberg and Hirst (2017) for an introduction of the basic architectures in the NLP context). Convolutional Neural Networks have seen an increase in the past years, whereas the popularity of the traditional Recurrent Neural Network (RNN) is dropping. This is due, on the one hand, to the availability of more efficient RNN-based architectures such as LSTM and GRU. On the other hand, a new and pretty disruptive mechanism for sequential processing – attention – has been introduced in the sequence-to-sequence (seq2seq) model by Sutskever et al. (2014). If you use Google Translate, you might have noticed the leapfrog in the translation quality a couple of years ago – seq2seq was the culprit. And while seq2seq still relies on RNNs in the pipeline, the transformer architecture, another major advance from 2017, finally gets rid of recurrence and completely relies on the attention mechanism (Vaswani et al. 2017).
Deep Learning is a vibrant and fascinating domain, but it can also be quite intimidating from the application point of view. When it does, keep in mind that most developments are motivated by increased efficiency at Big Data scale, context awareness and scalability to different tasks and languages. For a mathematical introduction, Young et al. (2018) present an excellent overview of the state-of-the-art algorithms.
4. Consolidating various NLP tasks
When we look at specific NLP tasks such as sentiment analysis and named entity recognition, the inventories are much steadier than for the underlying algorithms. Over the years, there has been an gradient evolution from preprocessing tasks such as stemming over syntactic parsing and information extraction to semantically oriented tasks such as sentiment/emotion analysis and semantic parsing. This corresponds to the three “global” NLP development curves – syntax, semantics and context awareness – as described by Cambria et al. (2014). As we have seen in the previous section, the third curve – the awareness of a larger context – has already become one of the main drivers behind new Deep Learning algorithms.
From an even more general perspective, there is an interesting trend towards task-agnostic research. In Section 2, we saw how the generalization power of modern mathematical approaches has been leveraged in scenarios such as transfer learning and pre-training. Indeed, modern algorithms are developing amazing multi-tasking powers – thus, the relevance of the specific task at hand decreases. The following chart shows an overall decline in the discussion of specific NLP tasks since 2006:
Figure 4: Amount of discussion of specific NLP tasks
5. A note on multilingual research
With globalization, going international becomes an imperative for business growth. English is traditionally the starting point for most NLP research, but the demand for scalable multilingual NLP systems increases in recent years. How is this need reflected in the research community? Think of different languages as different lenses through which we view the same world – they share many properties, a fact that is fully accommodated by modern learning algorithms with their increasing power for abstraction and generalization. Still, language-specific features have to be thoroughly addressed especially in the preprocessing phase. As the following chart shows, the diversity of languages addressed in ACL research keeps increasing:
Figure 5: Frequent languages per year (> 10 mentions per language)
However, just as seen for NLP tasks in the previous section, we can expect a consolidation once language-specific differences have been neutralized for the next wave of algorithms. The most popular languages are summarised in Figure 6.
Figure 6: Languages addressed by ACL research
For some of these languages, research interest meets commercial attractiveness: languages such as English, Chinese and Spanish bring together large quantities of available data, huge native speaker populations and a large economic potential in the corresponding geographical regions. However, the abundance of “smaller” languages also shows that the NLP field is generally evolving towards a theoretically sound treatment of multilinguality and cross-linguistic generalisation.
Spurred by the global AI hype, the NLP field is exploding with new approaches and disruptive improvements. There is a shift towards modelling meaning and context dependence, probably the most universal and challenging fact of human language. The generalisation power of modern algorithms allows for efficient scaling across different tasks, languages and datasets, thus significantly speeding up the ROI cycle of NLP developments and allowing for a flexible and efficient integration of NLP into individual business scenarios.
Follow us for a review of ACL 2019 and more updates on NLP trends!
- E. Cambria and B. White (2014). Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]. Comp. Intell. Mag. 9, 2.
- J. Devlin, M. Wei, K. Lee and K. Toutanova (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
- Y. Goldberg and G. Hirst (2017). Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers.
- T. Mikolov et al. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems – vol. 2 (NIPS’13).
- R. Prabhavalkar, K. Rao, Kanishka, T. Sainath, B. Li, L. Johnson and N. Jaitly (2017). A Comparison of Sequence-to-Sequence Models for Speech Recognition. 939-943. 10.21437/Interspeech.2017-233.
- I. Sutskever, O. Vinyals, and Q. V. Le (2014). Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems – vol. 2 (NIPS’14).
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17).
- T. Young, D. Hazarika, S. Poria and E. Cambria (2018). Recent Trends in Deep Learning Based Natural Language Processing. In IEEE Computational Intelligence Magazine– vol. 13.