  • You’ve written that “For me, clean data (cleansed by somebody else) is dead. It has lost its soul.” When data loses its soul, what is the effect on the general public?


This wasn’t meant to be a religious statement—it is a personal attitude. What happens when somebody cleans the data is that it loses the vital details that help me to identify all the potential issues. I can no longer ‘debug’ the data. The result, typically, is that I am missing something that ultimately will limit the performance I can tease out of it.

Think of it as a detective game: if somebody has cleaned the scene of the crime, you will have a hard time finding out what actually happened and your chances of catching the bad guy decrease considerably.

So, what is the effect on the general public? The data will be less useful, and we will realize later than necessary that the way we collected the data had issues, the insights gleaned from it are potentially biased, and it will take us longer to discover that than it should. There are no obvious threats—it just limits our ability to utilize the full potential of the data.


  • Big data is a hot topic these days. What are the pros and cons of this development? What would you like to see emerge from all the big-data talk taking place right now?

There are two major pros of very different nature. On the technology side, the advances of cheap data collection and storage make the advantages of data analytics available to all layers of both the economy and society: from schools and nonprofits to startups and the big players. At the same time, the hype around big data has fueled a tremendous excitement about using data and created an unprecedented demand for data scientists, which in turn has translated to an influx of talent. I have always seen the opportunities, and soon, due to the increased manpower and skill, we might actually get to realize them. There might be a few disappointments on the rocky road, but I am excited about the future.


  • Of all the technological advancements in the big data field over the past few years, which would you say has been the most significant, and why?

Clearly, the most amazing advancement is the lower cost of collecting and storing data, while making access extremely fast and easy. Gone are the days of storing data somewhere on tape and needing 30 days to access it. Today we live in a world where we can collect terrabytes daily and still access some tiny slice of it from May 4, 2010 in about one minute. This possibility opens the door to much more elaborate analysis AND, more importantly, allows every analysis to start from the raw data rather than starting with the curated (and soulless) aggregated summaries.

