Michael Collins’ Natural Language Processing course review

We are in the middle of a paradigm shift, probably the biggest in history. The higher education system is changing. Education is becoming free and accessible for anybody who has a computer with Internet access. Coursera.org is the first resource providing access to free online classes from the leading universities worldwide. Many of the classes offered at Coursera are directly related to the areas we are focusing at NLP People, i.e. NLP, data mining and machine learning.

We asked Plutarco Naranjo who is one of the students that followed the recent NLP course at Coursera.org from the very beginning until the last day to share his thoughts and experience with the class.

I took the Columbia University course on Natural Language Processing taught by Prof. Michael Collins at Coursera and was really pleased by it. I will review the course’s contents especially through the programming assignments. I will give you my opinion on Prof. Collins’ presentations, the teaching materials and the course’s discussion forum.

The first assignment we had was on the subject of tagging, a leitmotif in the course. We had to identify gene names within biological texts by building a trigram hidden Markov model and a Viterbi decoder. NLTK or similar tools were not allowed. I took the opportunity to upgrade my Python development environment to PyDev and my Python programming skills to an intermediate level. Then I was able to translate the mathematical formulas to simple recursive functions.

 


The lecture notes go into more detail than the video lectures and contain a few theorems for the more mathematically inclined, yet his style is consistently clear, engaging and easy to follow. 


 

In the second assignment we considered parsing, another recurrent topic in the class. Here we were given a training corpus consisting of questions, whose annotated grammar had already been converted to Chomsky Normal Form, and we had to find the maximum-likelihood estimates for the production rules. Then we implemented the CKY algorithm to find the most probable parse tree for unseen questions. It took 200 years to go from Newton’s deterministic model of physics to the probabilistic model of quantum mechanics, but less than 40 years to transition from Chomsky’s deterministic grammars in the 50’s, to probabilistic grammars in the 90’s.

Machine translation was the topic of the next assignment. We considered a statistical approach to machine translation based on an idea I found truly counter-intuitive: word alignments. IBM models 1 and 2 estimate the probabilities that, say, the fifth word in an English sentence is related to the eight word in its Spanish translation, regardless of the linguistic content;, syntax, semantics and pragmatics are just ignored. All that matters is the absolute position of words in a sentence. I guess you really need to be a machine to think like that. Of course that’s only part of the story because you also have to look at other probability distributions to come up with a feasible translation. So the task was to find the most probable word alignments given a subset of the Europarl parallel corpus. These alignments, we were told, would then feed into a more modern phrase-based translation system, as explained in detail in the lectures. For an example check Moses.

The final project was the same tagging task as the first assignment, but this time using global linear models (GLM). A significant difference was that now we could define our own features with freedom taking advantage of GLM’s versatility. Interestingly, training was done with a variation of the good old perceptron algorithm. Decoding was again implemented using the ubiquitous Viterbi algorithm.

Some of us would have liked to try our hand at unsupervised learning with the Brown clustering algorithm, which was covered in the final set of lectures. However there were many other interesting topics to cover with programming exercises. For instance, Ratnaparkhi’s strategy for converting parsing into a tagging problem by means of log-linear models and beam search. Another good example is a quick and simple dependency parser based on global linear models and the perceptron; not to be confused with state-of-the-art second-order dependency parsers.

Michael Collins motivates every topic, develops intuition by working out in detail fun sentences like “the dog laughs” and provides the mathematical insight essential to understand the formulas and algorithms. His presentation style is well suited for online audience: he uses a red pen and fills the slides with explanations and examples. It’s too bad we didn’t get his annotated slides, only the “clean” ones. However we did get his own notes for most of the lectures; parts of what, some of us hope, one day will become a book. There is great coherence between the lectures, the texts and the exercises. The lecture notes go into more detail than the video lectures and contain a few theorems for the more mathematically inclined, yet his style is consistently clear, engaging and easy to follow.

 


It took 200 years to go from Newton’s deterministic model of physics to the probabilistic model of quantum mechanics, but less than 40 years to transition from Chomsky’s deterministic grammars in the 50’s, to probabilistic grammars in the 90’s.


 

The course had four graded quizzes and many in-lecture quizzes designed to ensure you were grasping the mathematical theory and the algorithms’ logic.

If you had trouble with any aspect of the course you could find abundant help in the on-line forum. Prof. Collins and specially his teaching assistant, Alexander Rush, participated in the forum, but the overwhelming participation of students gave this educational media a real advantage over traditional courses. The forum was manned by a crowd 24/7; response times were generally good. In fact, when I had a question I often found it had already been posted by someone else and there was a whole discussion thread about it. Furthermore, there were many interesting and insightful postings not only on the subjects of the course but reflecting a broad range of scientific and technical concerns and curiosities.

The majority of students who expressed an opinion on the course were very satisfied with it. Those who had difficulty, as it seemed to me, were new to machine learning or computational linguistics. I wouldn’t consider this course introductory, but a graduate level course where you need the preliminary basis to really take advantage of the new material. Several people requested Prof. Collins for a follow-up course; I think it would also be a success.

The philosopher John Searle might have never imagined that his Chinese Room would be implemented as a statistical engine making linguistic inferences, but after this course it seems clear that the way machines process natural language entails no understanding of it at all. That’s not to say that machines that learn how to use their robotic bodies don’t have their own true machine understanding of the tasks they do, they just think like machines.

Plutarco Naranjo

Independent researcher

plutarco@lenguaje.com

 About the author

Plutarco Naranjo holds a BS in Mathematics, an MSc in Computer Science and an MSc in Evolutionary and Adaptive Systems. He has done work for IBM, Apple, Autodesk, and others. He founded a company, Signum Cía. Ltda., in Ecuador which developed an electronic dictionary for Editorial Gredos, Spanish proofing tools for Microsoft and a popular spell checking and synonyms Web site www.lenguaje.com. Currently he is an independent researcher exploring ways to apply NLP techniques to the financial markets.

 

 

 

Blog

Leave a Reply