Doing Data Science: Straight Talk from the Frontline

Doing Data Science: Straight Talk from the Frontline PDF

Author: Cathy O'Neil and Rachel Schutt

Publisher: O'Reilly Media


Publish Date: November 3, 2013

ISBN-10: 1449358659

Pages: 408

File Type: PDF

Language: English

read download


Book Preface

Data science is an emerging field in industry, and as yet, it is not well-defined as an academic subject. This book represents an ongoing in‐ vestigation into the central question: “What is data science?” It’s based on a class called “Introduction to Data Science,” which I designed and taught at Columbia University for the first time in the Fall of 2012.

In order to understand this book and its origins, it might help you to understand a little bit about me and what my motivations were for creating the class.


In short, I created a course that I wish had existed when I was in college, but that was the 1990s, and we weren’t in the midst of a data explosion, so the class couldn’t have existed back then. I was a math major as an undergraduate, and the track I was on was theoretical and proof-oriented. While I am glad I took this path, and feel it trained me for rigorous problem-solving, I would have also liked to have been ex‐ posed then to ways those skills could be put to use to solve real-world problems.

I took a wandering path between college and a PhD program in sta‐ tistics, struggling to find my field and place—a place where I could put my love of finding patterns and solving puzzles to good use. I bring this up because many students feel they need to know what they are “going to do with their lives” now, and when I was a student, I couldn’t plan to work in data science as it wasn’t even yet a field. My advice to students (and anyone else who cares to listen): you don’t need to figure it all out now. It’s OK to take a wandering path. Who knows what you might find? After I got my PhD, I worked at Google for a few years around the same time that “data science” and “data scientist” were be‐ coming terms in Silicon Valley.

The world is opening up with possibilities for people who are quan‐ titatively minded and interested in putting their brains to work to solve the world’s problems. I consider it my goal to help these students to become critical thinkers, creative solvers of problems (even those that have not yet been identified), and curious question askers. While I myself may never build a mathematical model that is a piece of the cure for cancer, or identifies the underlying mystery of autism, or that prevents terrorist attacks, I like to think that I’m doing my part by teaching students who might one day do these things. And by writing this book, I’m expanding my reach to an even wider audience of data scientists who I hope will be inspired by this book, or learn tools in it, to make the world better and not worse.

Building models and working with data is not value-neutral. You choose the problems you will work on, you make assumptions in those models, you choose metrics, and you design the algorithms.

The solutions to all the world’s problems may not lie in data and tech‐ nology—and in fact, the mark of a good data scientist is someone who can identify problems that can be solved with data and is well-versed in the tools of modeling and code. But I do believe that interdiscipli‐ nary teams of people that include a data-savvy, quantitatively minded, coding-literate problem-solver (let’s call that person a “data scientist”) could go a long way.

Origins of the Class

I proposed the class in March 2012. At the time, there were three pri‐ mary reasons. The first will take the longest to explain.
Reason 1: I wanted to give students an education in what it’s like to be a data scientist in industry and give them some of the skills data sci‐ entists have.

I was working on the Google+ data science team with an interdisci‐ plinary team of PhDs. There was me (a statistician), a social scientist, an engineer, a physicist, and a computer scientist. We were part of a larger team that included talented data engineers who built the data pipelines, infrastructure, and dashboards, as well as built the experi‐ mental infrastructure (A/B testing). Our team had a flat structure.

Together our skills were powerful, and we were able to do amazing things with massive datasets, including predictive modeling, proto‐ typing algorithms, and unearthing patterns in the data that had huge impact on the product.

We provided leadership with insights for making data-driven deci‐ sions, while also developing new methodologies and novel ways to understand causality. Our ability to do this was dependent on topnotch engineering and infrastructure. We each brought a solid mix of skills to the team, which together included coding, software engineer‐ ing, statistics, mathematics, machine learning, communication, visu‐ alization, exploratory data analysis (EDA), data sense, and intuition, as well as expertise in social networks and the social space.

To be clear, no one of us excelled at all those things, but together we did; we recognized the value of all those skills, and that’s why we thrived. What we had in common was integrity and a genuine interest in solving interesting problems, always with a healthy blend of skep‐ ticism as well as a sense of excitement over scientific discovery. We cared about what we were doing and loved unearthing patterns in the data.

I live in New York and wanted to bring my experience at Google back to students at Columbia University because I believe this is stuff they need to know, and because I enjoy teaching. I wanted to teach them what I had learned on the job. And I recognized that there was an emerging data scientist community in the New York tech scene, and I wanted students to hear from them as well.
One aspect of the class was that we had guest lectures by data scientists currently working in industry and academia, each of whom had a dif‐ ferent mix of skills. We heard a diversity of perspectives, which con‐ tributed to a holistic understanding of data science.

Reason 2: Data science has the potential to be a deep and profound research discipline impacting all aspects of our lives. Columbia Uni‐ versity and Mayor Bloomberg announced the Institute for Data Sciences and Engineering in July 2012. This course created an oppor‐ tunity to develop the theory of data science and to formalize it as a legitimate science.

Reason 3: I kept hearing from data scientists in industry that you can’t teach data science in a classroom or university setting, and I took that on as a challenge. I thought of my classroom as an incubator of data science teams. The students I had were very impressive and are turning into top-notch data scientists. They’ve contributed a chapter to this book, in fact.

Download Ebook Read Now File Type Upload Date
Download Now here Read Now


PDF November 17, 2019

Do you like this book? Please share with your friends, let's read it !! :)

How to Read and Open File Type for PC ?