Fundamentals of Robust Machine Learning (eBook)
736 Seiten
Wiley (Verlag)
978-1-394-29438-1 (ISBN)
An essential guide for tackling outliers and anomalies in machine learning and data science.
In recent years, machine learning (ML) has transformed virtually every area of research and technology, becoming one of the key tools for data scientists. Robust machine learning is a new approach to handling outliers in datasets, which is an often-overlooked aspect of data science. Ignoring outliers can lead to bad business decisions, wrong medical diagnoses, reaching the wrong conclusions or incorrectly assessing feature importance, just to name a few.
Fundamentals of Robust Machine Learning offers a thorough but accessible overview of this subject by focusing on how to properly handle outliers and anomalies in datasets. There are two main approaches described in the book: using outlier-tolerant ML tools, or removing outliers before using conventional tools. Balancing theoretical foundations with practical Python code, it provides all the necessary skills to enhance the accuracy, stability and reliability of ML models.
Fundamentals of Robust Machine Learning readers will also find:
- A blend of robust statistics and machine learning principles
- Detailed discussion of a wide range of robust machine learning methodologies, from robust clustering, regression and classification, to neural networks and anomaly detection
- Python code with immediate application to data science problems
Fundamentals of Robust Machine Learning is ideal for undergraduate or graduate students in data science, machine learning, and related fields, as well as for professionals in the field looking to enhance their understanding of building models in the presence of outliers.
Resve Saleh, (PhD, UC Berkeley) is a Professor Emeritus at the University of British Columbia. He worked for a decade as a professor at the University of Illinois and as a visiting professor at Stanford University. He was Founder and Chairman of Simplex Solutions, Inc., which went public in 2001. He is an IEEE Fellow and Fellow of the Canadian Academy of Engineering.
Sohaib Majzoub, (PhD, University of British Columbia) is an Associate Professor at the University of Sharjah, UAE. He also taught at the American University in Dubai, UAE and at King Saud University, KSA, and a visiting professor at Delft Technical University in The Netherlands. He is a Senior Member of the IEEE.
A. K. MD. Ehsanes Saleh, (PhD, University of Western Ontario) is a Professor Emeritus and Distinguished Professor in the School of Mathematics and Statistics, Carleton University, Ottawa, Canada. He also taught as Simon Fraser University, the University of Toronto, and Stanford University. He is a Fellow of IMS, ASA and an Honorary Member of SSC, Canada.
An essential guide for tackling outliers and anomalies in machine learning and data science. In recent years, machine learning (ML) has transformed virtually every area of research and technology, becoming one of the key tools for data scientists. Robust machine learning is a new approach to handling outliers in datasets, which is an often-overlooked aspect of data science. Ignoring outliers can lead to bad business decisions, wrong medical diagnoses, reaching the wrong conclusions or incorrectly assessing feature importance, just to name a few. Fundamentals of Robust Machine Learning offers a thorough but accessible overview of this subject by focusing on how to properly handle outliers and anomalies in datasets. There are two main approaches described in the book: using outlier-tolerant ML tools, or removing outliers before using conventional tools. Balancing theoretical foundations with practical Python code, it provides all the necessary skills to enhance the accuracy, stability and reliability of ML models. Fundamentals of Robust Machine Learning readers will also find: A blend of robust statistics and machine learning principlesDetailed discussion of a wide range of robust machine learning methodologies, from robust clustering, regression and classification, to neural networks and anomaly detectionPython code with immediate application to data science problems Fundamentals of Robust Machine Learning is ideal for undergraduate or graduate students in data science, machine learning, and related fields, as well as for professionals in the field looking to enhance their understanding of building models in the presence of outliers.
Preface
Outliers are part of almost every real‐world dataset. They can occur naturally as part of the characteristics of the data being collected. They can also be due to statistical noise in the environment that might be unavoidable. More commonly, they are associated with measurement error or instrumentation error. Another source is human error, such as typographical errors or misinterpreting the measurements of a device. If there are extreme outliers, they are often referred to as anomalies. Sometimes, the true data is differentiated from outliers by calling them inliers. While outliers may represent a small portion of the dataset, their impact can be quite significant.
The machine learning and data science techniques in use today largely ignore outliers and their potentially harmful effects. For many, outliers are somewhat of a nuisance during model building and prediction. They are hard to detect in both regression and classification problems. Therefore, it is easier to ignore them and hope for the best. Alternatively, various ad hoc techniques are used to remove them from the dataset even at the risk of inadvertently removing valuable inlier data in the process. But we have reached a point in data science where these approaches are no longer viable. In fact, new methods have emerged recently with great potential to properly address outliers and they should be investigated thoroughly.
The cost of ignoring this under‐reported and often overlooked aspect of data science can be significant. In particular, outliers and anomalies in datasets may lead to inaccurate models that result in making bad business decisions, producing questionable explanations of cause‐and‐effect, arriving at the wrong conclusions, or making incorrect medical diagnoses, just to name a few. A prediction is only as good as the model on which it is based, and if the model is faulty, so goes the prediction. Even one outlier can render a model unusable if it happens to be in the wrong location. Machine learning practitioners have not yet fully embraced a class of robust techniques that would provide more reliable models and more accurate predictions than is possible with present‐day methods. Robust methods are better‐suited to data science, especially when outliers are present. The overall goal of this book is to provide the rationale and techniques for robust machine learning and then build on that material toward robust data science.
This book is a comprehensive study of outliers in datasets and how to deal with them in machine learning. It evaluates the robustness of existing methods such as linear regression using least squares and Huber's method, and binary classification using the cross‐entropy loss for logistic regression and neural networks, as well as other popular methods including ‐nearest neighbors, support vector machines and random forest. It provides a number of new approaches using the log‐cosh loss which is very important in robust machine learning. Furthermore, techniques that surgically remove outliers from datasets for both regression and classification problems are presented. The book is about the pursuit of methods and procedures that recognize the adverse effects that outliers can have on the models built by machine learning tools. It is intended to move the field toward robust data science where the proper tools and methodologies are used to handle outliers. It introduces a number of new ideas and approaches to the theory and practice of robust machine learning and encourages readers to pursue further investigation in this field.
This book offers an interdisciplinary perspective on robust machine learning. The prerequisites are some familiarity with probability and statistics, as well as the basics of machine learning and data science. All three areas are covered in equal measure. For those who are new to the field and are looking to understand key concepts, we do provide the necessary introductory and tutorial material in each subject area at the beginning of each chapter. Readers with an undergraduate‐level knowledge of the subject matter will benefit greatly from this book.
You may have heard the phrase “regression to the mean.” In this book, we discuss “regression to the median.” The methods currently in use target the mean of the data to estimate the model parameters. However, the median is a better target because it is more stable in the presence of outliers. It is important to recognize that data science should be conducted using methods that are reliable and stable which is what the median‐based approach can offer. There are good reasons why we frequently hear phrases like “the median house price” or “the median household income.” It holds the key to building outlier‐tolerant models. Furthermore, robust methods offer stability and accuracy with or without outliers in the dataset.
We use the term “robust machine learning” as many of the techniques originate in the field of robust statistics. The term “robust” may seem somewhat unusual and confusing to some, but it is a well‐established term in the field of statistics. It was coined in the 1950s and has been used ever since. Note that the term robust machine learning has been used in other contexts in the literature, but here we specifically refer to “outlier‐tolerant” methods.
One may wonder why robust methods have not already been incorporated into machine learning tools. This is in part due to the long history of the non‐robust estimation methods in statistics and their natural migration to the machine learning community over the past two decades. Attempts to use the loss function (which is robust) were not successful in the past, whereas the loss (which is not robust) was much easier to understand and implement. It is strongly tied to the Gaussian distribution which made it even more compelling, especially in terms of the maximum likelihood procedure. The same can be said of the cross‐entropy loss used in binary classification. Most practitioners today are still employing least squares and cross‐entropy methods, both of which are not robust in the presence of outliers. We will show that the log‐cosh loss is robust, that it can be derived using maximum likelihood principles and inherits all the nice properties required of a loss function for use in machine learning. This removes all the past reasons for not using robust methods.
The approach taken in this book regarding outliers is to show how to robustify existing methods and apply them to data science problems. It revisits a number of the key machine tasks such as clustering, linear regression, logistic regression, and neural networks, and describes how they can all be robustified. It also covers the use of penalty estimators in the context of robust methods. In particular, the ridge, LASSO, and aLASSO methods are described. They are evaluated in terms of their ability to mitigate the effects of outliers. In addition, some very interesting approaches are described for variable ordering using aLASSO.
Outlier detection for regression and classification problems are addressed in detail. Previous approaches have not been able to perform this function without removing valuable data along with the outliers. The methods are essentially ad hoc in nature. In this book, practical solutions are provided using robust techniques. Of note is an iterative boxplot method for linear regression and a histogram‐based method for classification problems. Anomaly detection is another form of outlier detection where the outliers are at extreme locations and represent unusual and unexpected occurrences in the dataset. Identifying such anomalies is very important in the detection of suspicious activity such as bank fraud, spam in emails, and network intrusion. The techniques to be described in this book include ‐nearest neighbors (‐NN), DBSCAN, and Isolation Forest as they are popular techniques in this category. Also included is a new method based on robust statistics and ‐medians clustering called MADmax, which is shown to provide better results than current methods.
We wanted to write a book suitable for senior‐level (or fourth year) undergraduate and first/second‐year graduate students, that is also useful as a stand‐alone guide for researchers or practitioners in the field. As a result, there are equal parts of the theory and practice. Detailed derivations, theoretical support for the methods, as well as a substantial amount of “know‐how” and experience are part of every chapter with code segments that can be executed by the reader to improve understanding. We found that when we view existing methods through the lens of outliers, it leads to a deeper understanding of how current methods work and why they may fail. In this sense, some new knowledge will be gained in every chapter.
The programming code provided in this book are based on Python which is the workhorse language for the machine learning community. Many libraries and utilities are available in Python. We introduce code segments for all of the techniques for regression and classification, as well as the code for outlier removal in the form of projects at the end of each chapter. The reader would be well‐served to follow along the descriptions in the book while implementing the code wherever possible in Python. This is the best way to get the most out of this book.
The book is spread over 12 chapters. Chapter 1 begins with an introduction to...
| Erscheint lt. Verlag | 14.4.2025 |
|---|---|
| Sprache | englisch |
| Themenwelt | Mathematik / Informatik ► Mathematik ► Statistik |
| Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik | |
| Schlagworte | Government Communication • organizational listening • public policy communication • public policy frameworks • Public Sector Communication • public trust communication • stakeholder engagement public policy • strategic communication public policy |
| ISBN-10 | 1-394-29438-7 / 1394294387 |
| ISBN-13 | 978-1-394-29438-1 / 9781394294381 |
| Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
| Haben Sie eine Frage zum Produkt? |
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich