Doing Data Science
O'Reilly Media (Verlag)
9781449358655 (ISBN)
Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.
In many of these chapter-long lectures, data scientists from companies such as Google, Microsoft, and eBay share new algorithms, methods, and models by presenting case studies and the code they use. If you’re familiar with linear algebra, probability, and statistics, and have programming experience, this book is an ideal introduction to data science.
Topics include:
- Statistical inference, exploratory data analysis, and the data science process
- Algorithms
- Spam filters, Naive Bayes, and data wrangling
- Logistic regression
- Financial modeling
- Recommendation engines and causality
- Data visualization
- Social networks and data journalism
- Data engineering, MapReduce, Pregel, and Hadoop
Doing Data Science is collaboration between course instructor Rachel Schutt, Senior VP of Data Science at News Corp, and data science consultant Cathy O’Neil, a senior data scientist at Johnson Research Labs, who attended and blogged about the course.
Cathy O’Neil earned a Ph.D. in math from Harvard, was postdoc at the MIT math department, and a professor at Barnard College where she published a number of research papers in arithmetic algebraic geometry. She then chucked it and switched over to the private sector. She worked as a quant for the hedge fund D.E. Shaw in the middle of the credit crisis, and then for RiskMetrics, a risk software company that assesses risk for the holdings of hedge funds and banks. She is currently a data scientist on the New York start-up scene, writes a blog at mathbabe.org, and is involved with Occupy Wall Street.
Rachel Schutt is the Senior Vice President for Data Science at News Corp. She earned a PhD in Statistics from Columbia University, and was a statistician at Google Research for several years. She is an adjunct professor in Columbia’s Department of Statistics and a founding member of the Education Committee for the Institute for Data Sciences and Engineering at Columbia. She holds several pending patents based on her work at Google, where she helped build user-facing products by prototyping algorithms and building models to understand user behavior. She has a master's degree in mathematics from NYU, and a master's degree in Engineering-Economic Systems and Operations Research from Stanford University. Her undergraduate degree is in Honors Mathematics from the University of Michigan.
Chapter 1 Introduction: What Is Data Science?
Big Data and Data Science Hype
Getting Past the Hype
Why Now?
The Current Landscape (with a Little History)
A Data Science Profile
Thought Experiment: Meta-Definition
OK, So What Is a Data Scientist, Really?
Chapter 2 Statistical Inference, Exploratory Data Analysis, and the Data Science Process
Statistical Thinking in the Age of Big Data
Exploratory Data Analysis
The Data Science Process
Thought Experiment: How Would You Simulate Chaos?
Case Study: RealDirect
Chapter 3 Algorithms
Machine Learning Algorithms
Three Basic Algorithms
Exercise: Basic Machine Learning Algorithms
Summing It All Up
Thought Experiment: Automated Statistician
Chapter 4 Spam Filters, Naive Bayes, and Wrangling
Thought Experiment: Learning by Example
Naive Bayes
Fancy It Up: Laplace Smoothing
Comparing Naive Bayes to k-NN
Sample Code in bash
Scraping the Web: APIs and Other Tools
Jake’s Exercise: Naive Bayes for Article Classification
Chapter 5 Logistic Regression
Thought Experiments
Classifiers
M6D Logistic Regression Case Study
Media 6 Degrees Exercise
Chapter 6 Time Stamps and Financial Modeling
Kyle Teague and GetGlue
Timestamps
Cathy O’Neil
Thought Experiment
Financial Modeling
Exercise: GetGlue and Timestamped Event Data
Chapter 7 Extracting Meaning from Data
William Cukierski
The Kaggle Model
Thought Experiment: What Are the Ethical Implications of a Robo-Grader?
Feature Selection
David Huffaker: Google’s Hybrid Approach to Social Research
Chapter 8 Recommendation Engines: Building a User-Facing Data Product at Scale
A Real-World Recommendation Engine
Thought Experiment: Filter Bubbles
Exercise: Build Your Own Recommendation System
Chapter 9 Data Visualization and Fraud Detection
Data Visualization History
What Is Data Science, Redux?
A Sample of Data Visualization Projects
Mark’s Data Visualization Projects
Data Science and Risk
Data Visualization at Square
Ian’s Thought Experiment
Data Visualization for the Rest of Us
Chapter 10 Social Networks and Data Journalism
Social Network Analysis at Morning Analytics
Social Network Analysis
Terminology from Social Networks
Thought Experiment
Morningside Analytics
More Background on Social Network Analysis from a Statistical Point of View
Data Journalism
Chapter 11 Causality
Correlation Doesn’t Imply Causation
OK Cupid’s Attempt
The Gold Standard: Randomized Clinical Trials
A/B Tests
Second Best: Observational Studies
Three Pieces of Advice
Chapter 12 Epidemiology
Madigan’s Background
Thought Experiment
Modern Academic Statistics
Medical Literature and Observational Studies
Stratification Does Not Solve the Confounder Problem
Is There a Better Way?
Research Experiment (Observational Medical Outcomes Partnership)
Closing Thought Experiment
Chapter 13 Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
Claudia’s Data Scientist Profile
Data Mining Competitions
How to Be a Good Modeler
Data Leakage
How to Avoid Leakage
Evaluating Models
Choosing an Algorithm
A Final Example
Parting Thoughts
Chapter 14 Data Engineering: MapReduce, Pregel, and Hadoop
About David Crawshaw
Thought Experiment
MapReduce
Word Frequency Problem
Other Examples of MapReduce
Pregel
About Josh Wills
Thought Experiment
On Being a Data Scientist
Economic Interlude: Hadoop
Back to Josh: Workflow
So How to Get Started with Hadoop?
Chapter 15 The Students Speak
Process Thinking
Naive No Longer
Helping Hands
Your Mileage May Vary
Bridging Tunnels
Some of Our Work
Chapter 16 Next-Generation Data Scientists, Hubris, and Ethics
What Just Happened?
What Is Data Science (Again)?
What Are Next-Gen Data Scientists?
Being an Ethical Data Scientist
Career Advice
Index
Colophon
| Erscheint lt. Verlag | 3.12.2013 |
|---|---|
| Zusatzinfo | Illustrations (colour) |
| Verlagsort | Sebastopol |
| Sprache | englisch |
| Maße | 178 x 233 mm |
| Gewicht | 513 g |
| Einbandart | Paperback |
| Themenwelt | Informatik ► Datenbanken ► Data Warehouse / Data Mining |
| ISBN-13 | 9781449358655 / 9781449358655 |
| Zustand | Neuware |
| Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
| Haben Sie eine Frage zum Produkt? |
aus dem Bereich