Summary of “The Future of Data Science – Data Science @ Stanford”
Participants: Dan Boneh, PhD, Professor of Computer Science & Engineering, Stanford University; Euan Ashley, MD, PhD, Associate Professor of Medicine and Genetics, Stanford University; Vijay Pande, PhD, Professor of Chemistry, Stanford University; Hector Garcia-Molina, PhD, Professor of Engineering & Electrical Engineering, Stanford University; John Hennessy, PhD
How Data Science has revolutionised different fields? There has been tremendous technological advancement in the field of Computer Science and Engineering, and one major one is the change in the cost of the human genome. It used to cost three billion dollars to fund the human genome project a decade ago, and over the course of the last ten years that cost has dropped to 1000 dollars, so we can see that there is a million fold drop in the cost of the sequencing of the human genome. This is a transformative role played by the Data Science. One example of this is that there was a baby admitted to the ICU at the children’s hospital, who had QT syndromes, which is associated with the cardiac arrest. This little baby had multiple cardiac arrest at the day one of her life. We know the genetic basis of that condition, with the standard approach genetic test will take three months, and with this new technology, we can do the whole genome sequence where we produce six billion data points in 24 hours, analyse it using algorithms used to take days and now it takes hours and put the results back in the hands of the treating physicians in days.
Another application of the data science in health care is the design of small molecule drugs, why can not engineers put small molecules in an analogous way, this is a very difficult problem and there are many answers to his problem. The data is so huge that we cannot have single person work on the data and make sense out of it. So, the promise of data science is that we will be able to algorithmically gain insights that are really non obvious.
Some of the application of the data science in the traditional fields, like finance, commerce, manufacturing, where they are not as trendy, but they are still very important and there are a lot of great things that can happen to produce products and services cheaper, more efficient, and safer. Because we have lots more data about the process, manufacturing or about the customers, their needs and their desires. So it is a combination of more data and better ways to do things, i.e. more data, better algorithms and better services.
One thing that is possible now with data science is that the work that is going on with the deep learning and the funny thing about deep learning is that deep learning will give computers what we will call common sense, it is really the ability to deceive that relationship and that is happening now because we have learned enough about machine learning and we have enough data. You want to learn about the world? Go read the internet, but down all the relationships you get on the internet and build the system that can reason about them. So we need both large amount of data and large amount of computing power. We will spend lots of money to do things, which people could do easily right now with smaller amount of money, but in ten years that will be lots cheaper and that would be liberating.
One application of data science in the computer science is the computer security, as we realise attackers are constantly doing new things. For us it is not a large amount of data and new algorithms, it is actually the fact that it is new types of data that are being collected in order to detect the attacks. For example, Linkedin is constantly being bombarded with fake accounts, how do you detect fake accounts in the massive sea of accounts that they have, that is all based on the data science they can use.
But is this really a new field or it is like more computer science and more statistics? Is there anything fundamentally new happening here or are there new techniques that are being developed, new ideas are being developed or this is just a new name to the old concept?
Definitely data science is new, there is lots of data related work that is going on for the past years, but what really new is the discovery that a group of influential people have discovered that there is lots of value and power in data and mining and analysing data. That is what is exciting, this does not imply that everything is done.
“Anything that has called itself a science ate one’s.” – Jeff Ullman. For example, computer science has its foundations are in lots of other disciplines, but over time they became a core body of knowledge that distinguish what computer scientists know and it’s about algorithms, complexity, compatibility and autonomy and there is a core body of knowledge. Same thing is happening here, there is a core body of knowledge, which is getting built on statistics, mathematics and other disciplines such as computer science, but it will be increasingly distinct as our application frontiers move forward over time.
What are the techniques that make the field of data science its own unique distinctive field and not just Machine Learning or just databases or the things we had in the past? Are there specific things that one can learn about and become a data scientist?
We are using a collection of existing tools from different disciplines, it is not distinctive in that way and it may not be. As we can see statistics and computer science can embrace these techniques and developing them further as a natural evolution in their own discipline. So, basically, we don’t really know well what makes a data scientist a data scientist, but if someone wants to become a data scientist then he/she need to know machine learning, computation and statistics, these are the general areas that one must know in order to call oneself a data scientist.
What is the role of the University in this new world? How do we get the data that is proprietary, has privacy, regulations around it and companies are just not going to share the data with us. So, how we are going to do data science in academia? If we do not have the core data? Not all fields are data rich, but there are many, many fields just within the academia, which are data rich, e.g. there are lots of genomic data available within the hospitals in the University, astronomy is also very data rich, and Universities also have education data (admission). Education data at some level can be shared with the data science community to work with. They wanted help from the Universities to study this data, university can tell them what to do with this data, but at the end of the discussion we need to ask ourself how do we get the data? There are lots of synthetic data in medical and pharmaceutical fields, where using simulations the effect of a drug is tested. But can we rely on the model that are developed around synthetic data and the fact that we fit the algorithm to match the synthetic data. For getting a good synthetic data, we can ask a good question, but can we rely on these questions because the questions are demand specific. There are examples where we do not have the data at one place and the data need to be brought together in one place. In reality, there are lots of challenging problems to get the data that is usable, it is not that just get some file from somebody and you are all set to go. We have to organise the data, we have to understand what does it mean when we have a field there, how an instrument was used to measure this and how do I combine this reading in this field with another reading that is related but was done with different instrument or measured in a different way. So there are lots of challenges. How do we represent the accuracy of the data itself?
How does academia adapt the curriculum to deal with teaching data science skills? Computer Science can be done in a project oriented way, CS 106 students can see the value of computer science very quickly because they are actually doing it. We can imagine data science as projects, where project has to do with analysing the data, maybe students don’t collect the data themselves, which might not be possible. Education should start looking on more and more project base style and it makes students more exciting. There has been some success with this model, where at the beginning of the class, instructor has to put lots of data sets together and often arrange to meet the people who collected the data to get better understanding about the data. On the other hand, in the medical school there are slightly different challenges, where students are not trained to be data scientist, but doctors. Data science will be such an important part of medicine, going forward we need the students to be much more quantitatively aware than they currently are and there are some initiatives by the medical schools to train the student to understand the data and decide whether they can trust the data or not. If we think about core data science, we have lots of students who learned about the data science techniques in the context of the specific field, e.g. Biology, Chemistry, Engineering, we should also think about the core educational program that trained people who are deeper in the fundamental techniques and applying those techniques and maybe not so focused on the application in any single area and that is a good model to think about how that might be done.
In one example, a father who went to complain Target about his teenage daughter getting bombarded with ads for cribs and diapers, only to find out that his teenage daughter was pregnant before she told him. So, the question is how does Target actually do this? It turns out that what they did is actually having the data from the mothers signed up for their baby registry and they started analysing what soon to be mothers buy. It turns out in their second trimester they tend to buy unscented lotion and in their first twenty weeks they load up on supplements like calcium and magnesium and quickly they correlated that to the fact that someone is expecting just based on their purchase patterns, so that is the example of crossing the border of ethics and data. Should there be limits to what data being collected and generally what do we do about the ethical question?