Review of the Bill Howe’s “Introduction to Data Science” course
We continue supporting the Coursera.org initiatives providing access to world-class educational courses. Few weeks ago we published a review of the Columbia University course on the Natural Language Processing and got positive feedback from our readers. Now it is time to present another course called “Introduction to Data Science” taught by Professor Bill Howe (University of Washington) that is of great interest to everybody involved in data mining. We asked Federico Leven to share his experience after taking the course.
A few days ago the Coursera “Introduction to Data Science” taught by Prof. Bill Howe ended. It was a solid and entertaining introduction to data science. To get the most out of this course students should have probably had a good foundation in programming and SQL databases, but resources provided in the course were enough to put you in a position to complete the assignments.
In this publication, I will review the programming assignments, quizzes, peer assessments, videos lectures, additional readings and forum discussions. Resources that Prof. Howe and other course organizers provided to the users were:
– a ready-to-use Linux virtual machine with all the software required for course-related tasks already installed,
– excellent instructions on how to download required files from github,
– instruction on how to create and configure Amazon EMR,
– good tutorials for beginners in Python, that is the programming language of choice in the course.
The course was taught in an eight-week format. Blocks of video lectures presented by Prof. Howe were alternated with programming assignments. A one week soft deadline was given to complete the tasks. Later submissions (before the hard deadline) were still accepted with a penalty of 50% of the points. No grading was received if the assignment was submitted after the hard deadline, but participants could still submit answers and check the results.
Twitter Sentiment Analysis in Python: This first assignment was an introduction to Python. The secondary goal was to getting familiar with some basic concepts, like term frequencies, term rankings, etc. It was a good chance to refresh my python knowledge after a few years without working with it. Some issues with the autograding system that the staff tried to solve, finally caused an extended hardline to submit the results.
Blocks of video lectures presented by Prof. Howe were alternated with programming assignments.
Database Assignment – Simple In-Database Text Analytics: The second assignment was a good chance for all the students with SQL knowledge to get a good grading. Again, the problems came from the relatively easy operators like SELECT, UNION, etc (the first problems were presented using relational algebra) to tricky SQL statements, like matrix multiplication, similarity matrix, etc. In this case, the autograding worked like a charm.
MapReduce Algorithms: Having presented the video lectures on the topic, in which Prof. Howe outlined, together with the theory, some solutions to the assignments, we started to make our hands really dirty. There was a mix of similar problems presented in assignment 1 and assignment 2 that we had to solve them using MapReduce paradigm. The virtual machine provided for the course had a Python library that emulated the MapReduce paradigm, so there was no need to run the programs in a real MapReduce environment.
There were 2 quizzes, both covering MapReduce. I think it was a good decision to implement the quizzes in a external platform (jsMapReduce) and Amazon EMR. This let us learn new tools, jsMapReduce as a testing tool for MapReduce scripts and Amazon EMR as a well known cloud-based solution in Hadoop clusters.
The Amazon assignment was optional. It was well explained and the instructions to create and configure the cluster were given. The assignment was extensive and mostly dedicated to computing histogram based on the 0.5 TB dataset using Pig. This task led to different coding issues to solve, as well as finding the right cluster configuration.
Using the course own words, in addition to the programming assignments, there were two assignments graded through peer assessment. The first one entailed constructing a visualization using Tableau. The other – an open-ended mini-project in which students contested in a kaggle.com competition.
An optional project in which we worked on a real data science problem was proposed by a third party organization with real needs.
We had a chance to browse the projects posted in the real-world projects forum and/or organizations seeking assistance with sub-forum implementations
Kaggle Peer Review:For some of us, it was a new challenge. Kaggle competition is an exciting way to work with real projects data, but at the same time it is a platform to learn, practice and observe your progress. The goal of this task was to sign in a competition and propose a solution and possible improvements. It didn’t require any special technique. Participants could work with simple decision trees, but they were asked to propose a strategy to solve the problem and present the solution to peers in the course.
Optional Real World Project: We had a chance to browse the projects posted in the real-world projects forum and/or organizations seeking assistance with sub-forum implementations. Some of the projects called for skills we’ve developed in the first half of the course, others seemed more advanced given the content of the course. The objective of the assignment was to propose a solution not in terms of technology, but come up with an efficient data science-related strategy (random forest, regression, etc.).
Tableau Assignment: In this assignment, students were challenged to use Tableau to create a series of visualizations and use them to explore a dataset. In this case, we had a workbook were we had to (1) replicate some worksheets and (2) propose a question related to the data and (3) create a custom dashboard to answer that question. Then the peers evaluated if the worksheets were correctly replicated and if the custom dashboard was efficient.
The topics covered in the video lectures were the following :
Relational Databases, Relational Algebra
Basic Statistical Analysis
Topics in Machine Learning, Part 1
Topics in Machine Learning, Part 2
Perspectives Beyond MapReduce
The lectures by Prof. Howe were fun and of good length to watch on a busy schedule.
The lectures by Prof. Howe were fun and of good length to watch on a busy schedule. They supported the projects quite well and I think that the professor did an excellent job with this course. Unfortunately, some topics like Machine Learning lacked a programming assignment to support and apply the theory, so we took them as a theoretical introduction. The Visualization lectures were of particular interest for me since I am an intermediate level Tableau user. They came from the basic concepts to some advanced tips, really useful for users of all levels. Finally, a presentation of two products that goes beyond the MapReduce paradigm: Wibidata, with a good product presentation, and Datameer lecture was more of a sales pitch than lecture, which some students considered not appropriated (including me).
The unsung hero of the course is the Discussion Forum. It provided useful information by the staff, professor and students and fostered a community feel for the class and allowed more advanced students to mentor beginners, providing engaging experience across skill levels.
I am very positive about the course. I had background in MapReduce, graph visualization and programming, but the level of the course covered from basic to more advanced techniques in all the areas. The assignments reinforced the lectures and were appropriate. On the other side, some issues with the autograding system, complains from some students of the peer evaluation grades, and some delays that the staff and Prof Howe replied in the forums to help in some situations gave some negative impression. According to the forums, most of the students are satisfied with the course and some of them are expecting a new session.
About the author
Federico Leven is an Information Systems Analyst from Universidad Tecnologica Nacional – Buenos Aires, Argentina. He has been working as an enterprise application developer for the last 15 years in different companies in Argentina and US (Centralab, Entravision Corporation, ECommerce Partners and others). In the last 2 years Federico has been working with Hadoop, Python and R, developing data predictive models at Luminar Insights, a data analytical services company based in Denver, US, focused on latino market.