AI and ML Unlocked (eBook)
150 Seiten
Publishdrive (Verlag)
978-0-00-104716-7 (ISBN)
AI and ML Unlocked: A Course Book Bridging Fundamentals and Industry Challenges
From Foundational Concepts to Real-World Deployment and Ethical Considerations
Transform Your Understanding of Artificial Intelligence from Theory to Practice
In a world where artificial intelligence shapes everything from the photos on your phone to life-saving medical diagnoses, understanding how these systems work isn't just advantageous-it's essential. AI and ML Unlocked written with the help of AI, bridges the critical gap between abstract mathematical concepts and the practical skills needed to build, deploy, and responsibly manage AI systems that create real value.
Why This Book Stands Apart
Most AI education falls into two camps: dense academic texts that bury practical insights under layers of theory, or superficial tutorials that show you how to use tools without understanding why they work. This book takes a revolutionary third path-learning through building. Every mathematical concept connects directly to code you'll write. Every algorithm comes alive through projects you'll complete. Every ethical consideration emerges from real systems you'll design.
The 'Spiral of Understanding' Approach
Our unique pedagogical framework ensures deep, lasting comprehension:
Intuitive Foundation: Start with analogies and real-world examples that make complex ideas feel natural
Mathematical Clarity: Build rigorous understanding without drowning in notation
Hands-On Implementation: Strengthen knowledge through immediate practical application
Critical Analysis: Develop judgment about when, how, and whether to deploy different techniques
What You'll Master
Part I: The Foundation That Actually Matters
Move beyond memorizing definitions to understanding what makes machine learning fundamentally different from traditional programming. Grasp the mathematical concepts that power every AI system-linear algebra, calculus, and probability-through intuitive explanations and Python implementations that illuminate rather than intimidate.
Part II: Supervised and Unsupervised Learning in Action
Build classification and regression systems that solve real problems. Master decision trees, support vector machines, and clustering algorithms through projects with actual datasets. Learn not just how these algorithms work, but when to use each one and how to evaluate their performance honestly.
Part III: Deep Learning and Generative AI
Construct neural networks from scratch, then scale up to convolutional networks that can see and transformers that can understand language. Explore the cutting-edge world of generative AI and large language models, understanding both their remarkable capabilities and their significant limitations.
Part IV: The Production Reality
Bridge the notorious gap between promising prototypes and production systems. Master MLOps practices, learn to deploy models that can handle real-world scale and complexity, and understand how to monitor and maintain AI systems over time. Work through detailed case studies from healthcare, finance, and manufacturing.
Part V: Responsible AI Leadership
Develop the critical thinking skills to navigate bias, fairness, and explainability challenges. Understand the societal implications of AI systems and learn frameworks for making ethical decisions in high-stakes applications. Prepare for the evolving landscape of AI governance and regulation.
Chapter 3: Data and Preprocessing - The Unsung Heroes
Here's a truth that might surprise you: in most machine learning projects, you'll spend far more time working with data than building models. Data preprocessing isn't glamorous, but it's absolutely critical. A brilliant algorithm trained on poor data will fail, while a simple algorithm trained on high-quality, well-prepared data can achieve remarkable results.
The Importance of Data: Garbage In, Garbage Out
The phrase "garbage in, garbage out" is fundamental in data science. Your model can only learn patterns that exist in your training data. If that data is incomplete, biased, or irrelevant, your model will learn the wrong lessons.
Consider a resume screening system trained only on resumes from successful hires over the past 10 years. If historical hiring was biased toward certain demographics, the model will learn and perpetuate those biases. The algorithm isn't inherently biased—it's learning from biased historical data.
Data Quality Dimensions
High-quality data has several characteristics:
- Accuracy: The data correctly represents reality
- Completeness: No important information is missing
- Consistency: The same information is represented the same way everywhere
- Relevance: The data is actually useful for your problem
- Timeliness: The data is current and reflects the present situation
Data Structures and Types: Understanding Your Raw Materials
Data comes in many forms, and understanding these different types helps you choose appropriate preprocessing techniques and algorithms.
Tabular Data: The Familiar Spreadsheet
Tabular data is what most people think of when they hear "data"—rows and columns like a spreadsheet. Each row represents one observation (a customer, a transaction, a patient), and each column represents one feature or attribute.
import pandas as pd import numpy as np # Creating sample customer data customer_data = pd.DataFrame({ 'customer_id': [1001, 1002, 1003, 1004, 1005], 'age': [25, 34, 28, 42, 31], 'income': [45000, 78000, 52000, 95000, 63000], 'city': ['New York', 'Chicago', 'New York', 'Los Angeles', 'Chicago'], 'purchases_last_year': [12, 8, 15, 22, 9], 'customer_since': ['2020-03-15', '2019-07-22', '2021-01-08', '2018-11-30', '2020-09-12'] }) print(customer_data.head()) print(f"/nData shape: {customer_data.shape}") print(f"Data types:/n{customer_data.dtypes}")
Time-Series Data: When Order Matters
Time-series data is collected over time, and the order of observations matters. Stock prices, sensor readings, website traffic, and sales data are common examples.
# Creating sample time-series data dates = pd.date_range('2023-01-01', '2023-12-31', freq='D') np.random.seed(42) # Simulate daily sales with trend and seasonality trend = np.linspace(100, 150, len(dates)) seasonality = 20 * np.sin(2 * np.pi * np.arange(len(dates)) / 365.25) noise = np.random.normal(0, 5, len(dates)) sales = trend + seasonality + noise sales_data = pd.DataFrame({ 'date': dates, 'daily_sales': sales }) print(sales_data.head()) print(f"Sales range: ${sales_data['daily_sales'].min():.2f} to ${sales_data['daily_sales'].max():.2f}")
Text Data: The Challenge of Human Language
Text data presents unique challenges because computers don't naturally understand human language. Text needs to be converted into numerical representations before machine learning algorithms can work with it.
# Sample text data - customer reviews reviews_data = pd.DataFrame({ 'review_id': [1, 2, 3, 4, 5], 'rating': [5, 2, 4, 1, 5], 'review_text': [ 'Absolutely love this product! Fast delivery and great quality.', 'Disappointed with the purchase. Poor quality and overpriced.', 'Good value for money. Works as expected.', 'Terrible experience. Product broke after one day.', 'Excellent service and amazing product quality!' ] }) print(reviews_data) print(f"/nAverage rating: {reviews_data['rating'].mean():.1f}")
Image Data: Pixels as Features
Image data consists of pixels, where each pixel has color values. A grayscale image has one value per pixel (0-255), while color images typically have three values (RGB) per pixel.
Data Cleaning and Wrangling: Turning Mess into Gold
Real-world data is messy. It has missing values, inconsistent formats, duplicates, and errors. Data cleaning is the process of detecting and correcting these issues.
Handling Missing Values
Missing data is one of the most common issues you'll encounter. There are several strategies for dealing with it:
# Creating data with missing values to demonstrate handling techniques messy_data = pd.DataFrame({ 'name': ['Alice', 'Bob',
The Kernel Trick
The real power of SVMs comes from the "kernel trick". Sometimes data isn't separable with a straight line, but it becomes separable if we transform it to a higher dimension. Kernels allow SVMs to implicitly work in these higher dimensions without explicitly computing the transformation.
Linear Kernel: Finds straight-line boundaries. Good for linearly separable data.
RBF (Radial Basis Function) Kernel: Creates circular/curved boundaries. Good for complex, non-linear patterns.
Polynomial Kernel: Creates polynomial-curved boundaries. Good for data with polynomial relationships.
# Demonstrating kernels with non-linear data # Create circular data that isn't linearly separable np.random.seed(42) n_samples = 300 # Inner circle (class 0) angles_inner = np.random.uniform(0, 2*np.pi, n_samples//2) radii_inner = np.random.uniform(0, 1, n_samples//2) inner_x = radii_inner * np.cos(angles_inner) + np.random.normal(0, 0.1, n_samples//2) inner_y = radii_inner * np.sin(angles_inner) + np.random.normal(0, 0.1, n_samples//2) # Outer ring (class 1) angles_outer = np.random.uniform(0, 2*np.pi, n_samples//2) radii_outer = np.random.uniform(2, 3, n_samples//2) outer_x = radii_outer * np.cos(angles_outer) + np.random.normal(0, 0.1, n_samples//2) outer_y = radii_outer * np.sin(angles_outer) + np.random.normal(0, 0.1, n_samples//2) # Combine the data X_circular = np.column_stack([ np.concatenate([inner_x, outer_x]), np.concatenate([inner_y, outer_y]) ]) y_circular = np.concatenate([np.zeros(n_samples//2), np.ones(n_samples//2)]) # Split and scale X_train_circ, X_test_circ, y_train_circ, y_test_circ = train_test_split( X_circular, y_circular, test_size=0.3, random_state=42 ) scaler_circ = StandardScaler() X_train_circ_scaled = scaler_circ.fit_transform(X_train_circ) X_test_circ_scaled = scaler_circ.transform(X_test_circ) # Compare linear vs RBF kernel on circular data linear_svm = SVC(kernel='linear', random_state=42) rbf_svm = SVC(kernel='rbf', random_state=42) linear_svm.fit(X_train_circ_scaled, y_train_circ) rbf_svm.fit(X_train_circ_scaled, y_train_circ) linear_score = linear_svm.score(X_test_circ_scaled, y_test_circ) rbf_score = rbf_svm.score(X_test_circ_scaled, y_test_circ) print(f"/nCircular Data Classification:") print(f"Linear SVM accuracy: {linear_score:.3f}") print(f"RBF SVM accuracy: {rbf_score:.3f}") print(f"RBF improvement: {rbf_score - linear_score:.3f}") print("/nWhy RBF works better:") print("- Linear SVM tries to draw straight lines through circular patterns") print("- RBF SVM can create curved boundaries that follow the circular structure")
SVM Hyperparameters
SVMs have important hyperparameters that control their behavior:
C (Regularization parameter): Controls the trade-off between smooth decision boundary and classifying training points correctly. Higher C = less regularization = more complex boundaries.
gamma (for RBF kernel): Controls how far the influence of a single training example reaches. Higher gamma = more complex boundaries.
# Hyperparameter tuning for SVM from sklearn.model_selection import GridSearchCV # Define parameter grid param_grid = { 'C': [0.1, 1, 10, 100], 'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1] } # Grid search with cross-validation svm_grid = SVC(kernel='rbf', random_state=42) grid_search = GridSearchCV(svm_grid, param_grid, cv=5, scoring='accuracy', n_jobs=-1) grid_search.fit(X_train_circ_scaled, y_train_circ) print(f"SVM Hyperparameter Tuning Results:") print(f"Best parameters: {grid_search.best_params_}") print(f"Best CV score: {grid_search.best_score_:.3f}") # Test the best model best_svm = grid_search.best_estimator_ test_score_tuned = best_svm.score(X_test_circ_scaled, y_test_circ) print(f"Test accuracy with tuned parameters: {test_score_tuned:.3f}") # Compare with default parameters print(f"Improvement from tuning: {test_score_tuned - rbf_score:.3f}")
Performance Metrics: Beyond Accuracy
While accuracy is a good starting point, real-world problems often require more nuanced evaluation metrics. Let's explore advanced metrics that give deeper insights into model performance.
ROC Curves and AUC
The ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs. False Positive Rate at various threshold settings. The AUC (Area Under Curve) summarizes this into a single number.
# ROC Curves and AUC analysis from sklearn.metrics import roc_curve, auc, roc_auc_score # Get probability predictions from different models models_for_roc = { 'Logistic Regression': LogisticRegression(random_state=42), 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42), 'SVM': SVC(kernel='rbf', probability=True,...
| Erscheint lt. Verlag | 3.9.2025 |
|---|---|
| Sprache | englisch |
| Themenwelt | Technik |
| ISBN-10 | 0-00-104716-7 / 0001047167 |
| ISBN-13 | 978-0-00-104716-7 / 9780001047167 |
| Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
| Haben Sie eine Frage zum Produkt? |
Größe: 16,4 MB
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich