Tools and insights for a data scientist
Big scale data analysis has been around for years, but only recently it started to be recognized in industry as a valuable mechanism to foster a bunch of business processes. In this publication we tried to collect some of numerous open resources accessible by everybody working in the field of data mining.
Part 1, by Ryan Swanstrom, lists the most important paper in the field of data science.
Part 2, by Fari Payandeh, lists more than 50 O/S tools for Big Data.
Part 3, by Greg Reda, tells how data analysis can be done using basic Unix commands.
Part 1. 7 Important Data Science Papers
It is back-to-school time, and here are some papers to keep you busy this school year. All the papers are free. This list is far from exhaustive, but these are some important papers in data science and big data.
Google Search
- PageRank – This is the paper that explains the algorithm behind Google search.
Hadoop
- MapReduce – This paper explains a programming model for processing large datasets. In particular, it is the programming model used in hadoop.
- Google File System – Part of hadoop is HDFS. HDFS is an open-source version of the distributed file system explained in this paper.
NoSQL
These are 2 of the papers that drove/started the NoSQL debate. Each paper describes a different type of storage system intended to be massively scalable.
Machine Learning
- 10 algorithms in data mining | pdf download – This paper covers a number (10 to be exact) of important machine learning algorithms.
- A Few Useful Things to Know about Machine Learning – This paper is filled with tips, tricks, and insights to make machine learning more successful.
Bonus Paper
- Random Forests – One of the most popular machine learning techniques. It is heavily used in Kaggle competitions, even by the winners.
Originally published at Data Science 101.
About the author: Ryan Swanstrom lives in South Dakota, USA and he is a full-time web developer building data products and blogging about learning data science.
Part 2. 50+ Open Source Tools for Big Data
It was not easy to select a few out of many Open Source projects. My objective was to choose the ones that fit Big Data’s needs most. What has changed in the world of Open Source is that the big players have become stakeholders; IBM’s alliance with Cloud Foundry, Microsoft providing a development platform for Hadoop, Dell’s Open Stack-Powered Cloud Solution, VMware and EMC partnering on Cloud, Oracle releasing its NoSql database as Open Source.
“If you can’t beat them, join them”. History has vindicated the Open Source visionaries and advocates.
Hadoop Distributions
Cloud Operating System
- Cloud Foundry — By VMware
- OpenStack— Worldwide participation and well-known companies
Storage
- fusion-io — Not open source, but very supportive of Open Source projects; Flash-aware applications.
Development Platforms and Tools
- REEF — Microsoft’s Hadoop development platform
- Lingual — By Concurrent
- Pattern — By Concurrent
- Python — Awesome programming language
- Mahout — Machine learning programming language
- Impala — Cloudera
- R — MVP among statistical tools
- Storm — Stream processing by Twitter
- LucidWorks — Search, based on Apache Solr
- Giraph — Graph processing by Facebook
NoSql Databases
Sql Databases
- MySql — Belongs to Oracle
- MariaDB — Partnered with SkySql
- PostgreSQL — Object Relational Database
- TokuDB — Improves RDBMS performance
Server Operating Systems
- Red Hat — The defacto OS for Hadoop Servers
BI, Data Integration, and Analytics
Originally published at Big Data Studio.
About the author: Fari Payandeh is a Data Warehouse Technical Manager blogging about data science and managing the Scientific Forefront website.
Part 3. Useful Unix commands for data science
Imagine you have a 4.2GB CSV file. It has over 12 million records and 50 columns. All you need from this file is the sum of all values in one particular column.
How would you do it?
Writing a script in python/ruby/perl/whatever would probably take a few minutes and then even more time for the script to actually complete. A database and SQL would be fairly quick, but then you’d have load the data, which is kind of a pain.
Thankfully, the Unix utilities exist and they’re awesome.
To get the sum of a column in a huge text file, we can easily use awk. And we won’t even need to read the entire file into memory.
Let’s assume our data, which we’ll call data.csv, is pipe-delimited ( | ), and we want to sum the fourth column of the file.
cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'
The above line says:
Use the cat command to stream (print) the contents of the file to stdout.
Pipe the streaming contents from our cat command to the next one – awk.
With awk:
- Set the field separator to the pipe character (-F “|”). Note that this has nothing to do with our pipeline in point #2.
- Increment the variable sum with the value in the fourth column (). Since we used a pipeline in point #2, the contents of each line are being streamed to this statement.
- Once the stream is done, print out the value of sum, using printf to format the value with two decimal places.
It took less than two minutes to run on the entire file – much faster than other options and written in a lot fewer characters.
Hilary Mason and Chris Wiggins wrote over at the dataists blog about the importance of any data scientist being familiar with the command line, and I couldn’t agree with them more. The command line is essential to my daily work, so I wanted to share some of the commands I’ve found most useful.
For those who are a bit newer to the command line than the rest of this post assumes, Hilary previously wrote a nice introduction to it.
Other commands
head & tail
Sometimes you just need to inspect the structure of a huge file. That’s where head and tail come in. Head prints the first ten lines of a file, while tail prints the last ten lines. Optionally, you can include the -N parameter to change the number of lines displayed.
head -n 3 data.csv # time|away|score|home # 20:00||0-0|Jump Ball won by Virginia Commonwealt. # 19:45||0-0|Juvonte Reddic Turnover. tail -n 3 data.csv # 0:14|Trey Davis Turnover.|62-71| # 0:14||62-71|Briante Weber Steal. # 0:00|End Game|End Game|End Game
wc (word count)
By default, wc will quickly tell you how many lines, words, and bytes are in a file. If you’re looking for just the line count, you can pass the -l parameter in.
I use it most often to verify record counts between files or database tables throughout an analysis.
wc data.csv # 377 1697 17129 data.csv wc -l data.csv # 377 data.csv
grep
Grep allows you to search through plain text files using regular expressions. I tend avoid regular expressions when possible, but still find grep to be invaluable when searching through log files for a particular event.
There’s an assortment of extra parameters you can use with grep, but the ones I tend to use the most are -i(ignore case), -r (recursively search directories), -B N (N lines before), -A N (N lines after).
grep -i -B 1 -A 1 steal data.csv # 17:25||2-4|Darius Theus Turnover. # 17:25|Terrell Vinson Steal.|2-4| # 17:18|Chaz Williams made Layup. Assisted by Terrell Vinson.|4-4|
sed
Sed is similar to grep and awk in many ways, however I find that I most often use it when needing to do some find and replace magic on a very large file. The usual occurrence is when I’ve received a CSV file that was generated on Windows and my Mac isn’t able to handle the carriage return properly.
grep Block data.csv | head -n 3 # 16:43||5-4|Juvonte Reddic Block. # 15:37||7-6|Troy Daniels Block. # 14:05|Raphiael Putney Block.|11-8| sed -e 's/Block/Rejection/g' data.csv > rejection.csv # replace all instances of the word 'Block' in data.csv with 'Rejection' # stream the results to a new file called rejection.csv grep Rejection rejection.csv | head -n 3 # 16:43||5-4|Juvonte Reddic Rejection. # 15:37||7-6|Troy Daniels Rejection. parameters allow you to sort numerically and in reverse order, respectively. head -n 5 data.csv # time|away|score|home # 20:00||0-0|Jump Ball won by Virginia Commonwealt. # 14:05|Raphiael Putney Rejection.|11-8|
sort & uniq
Sort outputs the lines of a file in order based on a column key using the -k parameter. If a key isn’t specified, sort will treat each line as a concatenated string and sort based on the values of the first column. The -n and -r parameters allow you to sort numerically and in reverse order, respectively.
head -n 5 data.csv # time|away|score|home # 20:00||0-0|Jump Ball won by Virginia Commonwealt. # 19:45||0-0|Juvonte Reddic Turnover. # 19:45|Chaz Williams Steal.|0-0| # 19:39|Sampson Carter missed Layup.|0-0| head -n 5 data.csv | sort # 19:39|Sampson Carter missed Layup.|0-0| # 19:45|Chaz Williams Steal.|0-0| # 19:45||0-0|Juvonte Reddic Turnover. # 20:00||0-0|Jump Ball won by Virginia Commonwealt. # time|away|score|home # columns separated by '|', sort on column 2 (-k2), case insensitive (-f) head -n 5 data.csv | sort -f -t'|' -k2 # time|away|score|home # 19:45|Chaz Williams Steal.|0-0| # 19:39|Sampson Carter missed Layup.|0-0| # 20:00||0-0|Jump Ball won by Virginia Commonwealt. # 19:45||0-0|Juvonte Reddic Turnover.
Sometimes you want to check for duplicate records in a large text file – that’s when uniq comes in handy. By using the -c parameter, uniq will output the count of occurrences along with the line. You can also use the -dand -u parameters to output only duplicated or unique records.
sort data.csv | uniq -c | sort -nr | head -n 7 # 2 8:47|Maxie Esho missed Layup.|46-54| # 2 8:47|Maxie Esho Offensive Rebound.|46-54| # 2 7:38|Trey Davis missed Free Throw.|51-56| # 2 12:12||16-11|Rob Brandenberg missed Free Throw. # 1 time|away|score|home # 1 9:51||20-11|Juvonte Reddic Steal. sort data.csv | uniq -d # 12:12||16-11|Rob Brandenberg missed Free Throw. # 7:38|Trey Davis missed Free Throw.|51-56| # 8:47|Maxie Esho Offensive Rebound.|46-54| # 8:47|Maxie Esho missed Layup.|46-54| sort data.csv | uniq -u | wc -l # 369 (unique lines)
While it’s sometimes difficult to remember all of the parameters for the Unix commands, getting familiar with them has been beneficial to my productivity and allowed me to avoid many headaches when working with large text files.
Originally published at http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/
About the author: Greg Rada is a data nerd at GrubHub in Chicago. He is obsessed with all things data, statistics, music, and craft beer. He is also blogging at his own website.