Data Science (eBook)
198 Seiten
Azhar Sario Hungary (Verlag)
978-3-384-75553-7 (ISBN)
Your 2025 Blueprint to Master Data Science-From Zero to Generative AI Hero!
This book is your all-in-one launchpad into data science in 2025. Part 1 nails the mindset: computational + inferential thinking + real-world relevance, straight from Berkeley's Data 8. You'll master the CRISP-DM lifecycle, craft killer project proposals, and dissect Walmart's inventory genius. Day-one tools? Python (NumPy, Pandas), Git branching, SQL window functions, and Tableau dashboards. Ethics isn't an afterthought-fairness, bias audits, EU AI Act compliance, and DAMA-DMBOK governance are baked in. Visualization? Tufte's rules, GeoPandas maps, and public-health climate dashboards. Part 2 is the ML engine: hypothesis tests, A/B frameworks, causal DAGs with DoWhy, linear/logistic regression, decision trees to XGBoost, PCA, K-means, and Kaggle-winning feature hacks. Privacy? Differential privacy, federated learning, and HIPAA-safe GenAI. Every chapter ends with job-ready tutorials, LeetCode SQL, and real case studies-no fluff, just code you can run today.
Other books teach yesterday's tricks; this one arms you for 2025's frontier. While competitors recycle 2019 Kaggle notebooks, we weave Generative AI reality checks-LLM hallucinations vs. causal rigor-into every model. You won't just predict; you'll deploy production-grade pipelines with Airflow, dbt, Great Expectations, and Evidently monitoring. No ivory-tower theory: every concept ties to revenue, risk, or regulation. Unlike dense textbooks, our bite-sized code labs, Git workflows, and compliance checklists get you hired faster. This is the only guide that treats ethics, governance, and GenAI as core muscles, not side quests.
Copyright © 2025 Azhar ul Haque Sario. This work is independently produced under nominative fair use and has no affiliation with UC Berkeley, Stanford, DAMA, DASCA, or any cited institution or company.
Part 2: Core Machine Learning and Statistical Inference
Statistical Inference and Experimental Design
Welcome to the foundational pillar of "inferential thinking." This chapter is arguably the most important in your journey from someone who can describe data to someone who can make claims from data. We are moving beyond just observing what is in our dataset and learning the rigorous process of making conclusions about the world that data came from.
This is the statistical foundation for decision-making. We will cover classic hypothesis testing, the engine behind modern A/B testing, and then leap into the rapidly-expanding field of causal inference—the skill that separates a data analyst from a data scientist and strategic partner.
6.1: Fundamentals of Statistical Inference: Hypothesis Testing, p-values, and Confidence Intervals
Knowledge (Theory): The Language of Uncertainty
Statistical inference is the formal process of using sample data to make judgments about a larger population. The core idea is that we can never be 100% certain. Our sample is just one of many possible samples we could have drawn. Because of this random chance, we need a framework to quantify our uncertainty. Hypothesis testing is that framework.
It all starts with two competing claims:
The Null Hypothesis (H0): This is the default assumption, the "status quo," or the "boring" hypothesis. It represents the idea that nothing interesting is happening. For example: "This new website design has no effect on sales," or "There is no difference in test scores between the two groups." We always assume the null hypothesis is true until the data strongly convinces us otherwise.
The Alternative Hypothesis (H1 or Ha): This is the claim we are testing, the new idea. It's what we hope to find evidence for. For example: "The new website design increases sales," or "There is a difference in test scores."
Our data will help us decide between these two. How? By calculating a p-value.
The p-value is the most critical and most misunderstood concept in all of statistics. Let's be very clear: The p-value is not the probability that the null hypothesis is true.
Instead, the p-value is the "surprise factor." It's the probability of seeing data as extreme as ours (or more extreme) assuming the null hypothesis is true.
A high p-value (e.g., p = 0.50) means: "This data is not surprising at all. It looks exactly like the kind of random noise we'd expect if the null hypothesis were true." We fail to reject the null hypothesis.
A low p-value (e.g., p = 0.01) means: "Wow, this data would be extremely surprising (a 1-in-100 chance) if the null hypothesis were true. It's so surprising that we're starting to doubt the null hypothesis." We reject the null hypothesis in favor of the alternative.
The cutoff for "surprising" is called the significance level (alpha), typically set at 0.05.
Finally, we have Confidence Intervals (CIs). While a p-value gives a simple "yes/no" decision, a confidence interval gives a range. A 95% confidence interval is a range of plausible values for the true population parameter. If we're testing the effect of a new drug, the p-value might tell us "Yes, the effect is significant." The 95% CI tells us how big that effect might be, for example, "We are 95% confident the true effect is between a 2-point and 10-point reduction in blood pressure." This is far more useful for making real-world decisions.
Application (Tutorial): Putting scipy.stats to Work
In Python, we don't calculate p-values by hand. We use the scipy.stats library, a powerful toolkit for running these tests in seconds.
One-Sample t-test: We use this when we want to compare the average of our one sample to a known, fixed number.
Example: A company's support center claims its average customer call time is 180 seconds. We take a random sample of 50 calls and find the average is 195 seconds. Is our sample average significantly different from their claim of 180?
Python: We would use scipy.stats.ttest_1samp(our_sample_data, 180).
Interpretation: This function will return a t-statistic and a p-value. If the p-value is less than 0.05, we can confidently say, "Our data suggests the average call time is not 180 seconds; it's significantly longer."
Two-Sample t-test: We use this when we want to compare the averages of two independent groups.
Example: A marketing team sends two different email subject lines (Group A and Group B) to two random sets of users. They want to know if there is a difference in the average purchase value from the users who opened the email.
Python: We would use scipy.stats.ttest_ind(group_a_purchases, group_b_purchases).
Interpretation: A low p-value would tell us: "Yes, there is a statistically significant difference in average purchase value between the two subject lines." This forms the analytical core of A/B testing.
Job Skills
Statistical Hypothesis Testing: This is the formal skill of framing a business question (e.g., "Did the new feature work?") into a testable H0/H1 framework and interpreting the results.
scipy.stats: This is the practical, hands-on Python skill for executing the most common statistical tests.
Statistical Analysis: This is the broader skill of knowing which test to use, checking its assumptions (e.g., are the groups independent? Is the data normally distributed?), and clearly communicating the results to non-technical stakeholders.
6.2: Designing and Analyzing Live Experiments: The Framework for A/B Testing
Knowledge (Theory): The Gold Standard for Decisions
A/B testing (or more broadly, randomized controlled trials) is the primary commercial application of hypothesis testing. It is the gold standard for business decision-making because it solves the "correlation vs. causation" problem. By randomly assigning users to groups, we (on average) break any pre-existing correlations and ensure the only systematic difference between the groups is the one thing we are testing.
Here is the full lifecycle of a professional experiment:
Formulate the Hypothesis: This is the most important step. We must state a clear, testable question linked to a business metric.
Bad Hypothesis: "Our new homepage is better."
Good Hypothesis: "The new homepage design (version B) will cause a 2% or greater increase in the click-through rate on the 'Sign Up' button compared to the old design (version A)."
This translates directly to our H0/H1:
H0: Conversion Rate(A) = Conversion Rate(B). (The new button has no effect).
H1: Conversion Rate(B) > Conversion Rate(A). (The new button increases conversion rate).
Determine Sample Size (Power Analysis): We can't just run the test for "a few days." We must run it long enough to collect a large enough sample. A power analysis tells us the minimum sample size needed to reliably detect a meaningful effect. If we don't have enough "power," we might (incorrectly) conclude there is no difference, even when a real one exists.
Run the Test (Random Assignment): This is the magic. Users are randomly assigned to either the "control" (Group A) or the "treatment" (Group B, also called "variant"). This randomization is key. It ensures that, in large samples, both groups will have a similar mix of users (new vs. old, domestic vs. international, high-intent vs. low-intent), so we can be confident any difference we see is caused by our change.
Analyze the Results: After the test has run and collected the pre-determined sample size, we stop the test and check for statistical significance. We use the p-values and confidence intervals we learned about in 6.1. If our p-value is below 0.05, we declare a "winner" and can be confident the lift we see is real and not just random chance.
Application (Case Study): Is a 1% Lift Real?
Let's walk through a classic scenario.
A company is testing a new, green "Add to Cart" button (Version B) against its old, blue button (Version A).
Version A (Control): Shown to 5,000 users. 500 of them converted (added to cart).
Conversion Rate (A) = 500 / 5,000 = 10.0%
Version B (Treatment): Shown to 5,000 users. 550 of them converted.
Conversion Rate (B) = 550 / 5,000 = 11.0%
We see an observed lift of 1 percentage point (an 11% conversion rate is a 10% relative lift over a 10% rate). The business question is: Is this 1% lift a real, repeatable effect, or was Group B just "luckier" by random chance?
Since the outcome is binary (converted or not converted) and we have large samples, we don't use a t-test. We use a two-sample z-test for proportions.
H0: The true conversion rates are equal (p_A = p_B).
H1: The true conversion rate for B is greater than A (p_B > p_A).
We would use a library like statsmodels in Python to run this test (e.g., proportions_ztest). This test calculates the pooled proportion, the standard error, and then the z-score. This z-score tells us how many standard errors away from "no difference" our observed 1% lift...
| Erscheint lt. Verlag | 15.11.2025 |
|---|---|
| Sprache | englisch |
| Themenwelt | Mathematik / Informatik ► Informatik ► Netzwerke |
| Schlagworte | causal inference • Data Ethics • data science 2025 • generative AI • machine learning • python pandas • SQL advanced |
| ISBN-10 | 3-384-75553-7 / 3384755537 |
| ISBN-13 | 978-3-384-75553-7 / 9783384755537 |
| Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
| Haben Sie eine Frage zum Produkt? |
Digital Rights Management: ohne DRM
Dieses eBook enthält kein DRM oder Kopierschutz. Eine Weitergabe an Dritte ist jedoch rechtlich nicht zulässig, weil Sie beim Kauf nur die Rechte an der persönlichen Nutzung erwerben.
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür die kostenlose Software Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür eine kostenlose App.
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich