Ensemble Classification Methods with Applications in R (eBook)
John Wiley & Sons (Verlag)
978-1-119-42155-9 (ISBN)
An essential guide to two burgeoning topics in machine learning - classification trees and ensemble learning
Ensemble Classification Methods with Applications in R introduces the concepts and principles of ensemble classifiers methods and includes a review of the most commonly used techniques. This important resource shows how ensemble classification has become an extension of the individual classifiers. The text puts the emphasis on two areas of machine learning: classification trees and ensemble learning. The authors explore ensemble classification methods' basic characteristics and explain the types of problems that can emerge in its application.
Written by a team of noted experts in the field, the text is divided into two main sections. The first section outlines the theoretical underpinnings of the topic and the second section is designed to include examples of practical applications. The book contains a wealth of illustrative cases of business failure prediction, zoology, ecology and others. This vital guide:
- Offers an important text that has been tested both in the classroom and at tutorials at conferences
- Contains authoritative information written by leading experts in the field
- Presents a comprehensive text that can be applied to courses in machine learning, data mining and artificial intelligence
- Combines in one volume two of the most intriguing topics in machine learning: ensemble learning and classification trees
Written for researchers from many fields such as biostatistics, economics, environment, zoology, as well as students of data mining and machine learning, Ensemble Classification Methods with Applications in R puts the focus on two topics in machine learning: classification trees and ensemble learning.
ESTEBAN ALFARO, MATÍAS GÁMEZ AND NOELIA GARCÍA are Associate Professors at the Applied Economics Department (Statistics), Faculty of Economics and Business of Albacete, and researchers at the Regional Development Institute (IDR), University of Castilla-La Mancha. Together they have published several papers in prestigious journals on topics such as applications of ensemble trees to corporate bankruptcy, credit scoring and statistical quality control with the most notable in Journal of Statistical Software, Vol 54.
ESTEBAN ALFARO, MATÍAS GÁMEZ AND NOELIA GARCÍA are Associate Professors at the Applied Economics Department (Statistics), Faculty of Economics and Business of Albacete, and researchers at the Regional Development Institute (IDR), University of Castilla-La Mancha. Together they have published several papers in prestigious journals on topics such as applications of ensemble trees to corporate bankruptcy, credit scoring and statistical quality control with the most notable in Journal of Statistical Software, Vol 54.
1
Introduction
Esteban Alfaro Matías Gámez and Noelia García
1.1 Introduction
Classification as a statistical task is present in a wide range of real‐life contexts as diverse as, for example, the mechanical procedure to send letters based on the automatic reading of the postal codes, decisions regarding credit applications from individuals or the preliminary diagnosis of a patient's condition to enable immediate treatment while waiting for the final results of tests.
In its most general form, the term classification can cover any context in which a decision is taken or a prediction is made based on the information available at that time, and a classification procedure is, then, a formal method to repeat the arguments that led to that decision for new situations.
This work focuses on a more specific interpretation. The problem is to build a procedure that will be applied to a set of cases in which each new case has to be assigned to one of a set of predefined classes or subpopulations on the basis of observed characteristics or attributes.
The construction of a classification system from a set of data for which actual classes are known has been called different things, such as pattern recognition, discriminant analysis, or supervised learning. The latter name is used rather than unsupervised learning or clustering in which classes are not defined a priori but they are inferred from the data. This work focuses on the first type of classification tasks.
1.2 Definition
The most traditional statistical technique applied to supervised classification is linear discriminant analysis, but in recent decades a wider set of new methods has been developed, in part due to the improvement in the capabilities of informatics supports. Generally, the performance of a classification procedure is analysed based on its accuracy, that is, the percentage of correct classified cases. The existence of a correct classification implies the existence of an expert or supervisor capable of providing it, so why would we want to replace this exact system by an approximation? Among the reasons for this replacement we could mention:
- Speed. Automatic procedures are usually quick and they can help to save time. For instance, automatic readers of postal codes are able to read most letters, leaving only some very complex cases to human experts.
- Objectivity. Important decisions have to be taken basing on objective criteria under the same conditions for all cases. Objectivity is sometimes difficult to ensure in the case of human deciders. In such cases, decisions can be affected from external factors, which would led us to take biased decisions.
- Explanatory capabilities. Some of the classification methods allow us not only to classify observations but to explain the reasons for the decision in terms of a set of statistical features.
- Economy. Having an expert who make decisions can be much more expensive than developing an effective classification system from accumulated experience so that it can be applied by anyone, not necessarily expert on the subject but following the guidelines given by the classifier.
1.3 Taxonomy of Supervised Classification Methods
There is no single taxonomy of classification methods, but we can find a variety of them depending on the criterion of division we take, for example between parametric and nonparametric methods or between methods that attempt to estimate probability densities, posterior probabilities or just decision borders. If we consider the first criterion classification methods can be divided into:
- Parametric methods. These methods are based on the assumption of knowing the shape of the underlying density functions, generally the normal distribution. Then the problem is the parameter estimation, which is performed either by maximizing the likelihood or through Bayesian methods. Such methods include Fisher linear discriminant analysis, multiple discriminant analysis, quadratic discriminant, and the expectation–maximization algorithm, among others.
- Non‐parametric methods. These methods do not require any hypothesis about the underlying density functions so they are appropriate when the data probability distribution is unknown. These methods include Parzen windows estimation, K‐nearest neighbors, classification trees and artificial neural networks.
On the other hand, Lippmann (1991) recognizes five basic types of classifiers:
- Probabilistic methods. These are functional and parametric techniques and therefore indicated when the functional form fits well with the actual distribution of data and there is a sufficient number of examples to estimate parameters. As examples we can point to Gaussian or linear discriminant classifiers based on mixtures of normal distributions.
- Global methods. These are methods that build the discriminant function by internal nodes using sigmoid or polynomial functions that have high non‐zero responses over a large part of the input space. These methods include multilayer perceptron, Boltzmann machines, and high‐order polynomial networks.
- Local methods. Unlike the previous methods, these techniques build the discriminant function using nodes having nonzero responses only on localized regions of the input space. Examples of such methods are radial basis functions networks and the Kernel discriminant. The advantage of these methods is that they do not require assumptions about the underlying distributions.
- Nearest neighbour. These methods are based on the distance between a new element and the set of stored elements. Among the best‐known techniques are learning vector quantization (LVQ) and K‐nearest neighbours. These are non‐parametric methods but they require a lot of computing time.
- Rule‐based methods. These methods divide the input space into labelled regions through rules or logical thresholds. These techniques include classification trees.
The first three types of methods provide continuous outputs that can estimate either the likelihood or Bayes posterior probabilities, while the last two blocks provide binary outputs. Because of this difference, the first methods respond to a strategy of minimizing a cost function such as the sum of squared errors or entropy, while the second block will aim to minimize the number of misclassified items.
In this work we will focus on the last type of classifiers, to be used as base classifiers in the ensembles. Therefore, the accuracy (error) of the classification system will be measured by the percentage of successes (failures) in the classified elements.
1.4 Estimation of the Accuracy of a Classification System
In the development of a classification system three stages can be set up: selection, training, and validation. In the first stage, both the technique and the set of potential features must be selected. Once the first stage has been completed, it is time to start the learning process through a set of training examples. In order to check the performance of the trained classifier, that is to say its ability to classify new observations in the correct way, the accuracy has to be estimated.
Once the classifier has been validated, the system will be ready to be used. Otherwise, it will be necessary to return to the selection or the training stages, for example modifying the number and type of attributes, the number of rules and/or conjunctions, etc. or even looking for another more appropriate classification method.
To measure the goodness of fit of a classifier the error rate can be used. The true error rate of a classifier is defined as the error percentage when the system is tested on the distribution of cases in the population. This error rate can be empirically approximated by a test set consisting of a large number of new cases collected independently of the examples used to train the classifier. The error rate is defined as the ratio between the number of mistakes and the number of classified cases.
For the sake of simplicity, all errors are assumed to have the same importance, although this might not be true in a real case.
The true error rate could be computed if the number of examples tended to infinity. In a real case, however, the number of available examples is always finite and often relatively small. Therefore, the true error rate has to be estimated from the error rate calculated on a small sample or using statistical sampling techniques (random resampling, bootstrap, etc.). The estimation will usually be biased and the bias has to be analysed in order to find non‐random errors. Its variance is important too, seeking the greatest possible stability.
1.4.1 The Apparent Error Rate
The apparent error rate of a classifier is the error rate calculated from the examples of the training set. If the training set is unlimited, the rate of apparent error will coincide with the true error rate, but as already noted, this does not happen in the real world and, in general, samples of limited size will have to be used to build and evaluate a classifier system.
Overall, the rate of apparent error will be biased downwards so the apparent error rate will underestimate the true error rate (Efron, 1986). This usually happens when the classifier has been over‐adjusted to the particular characteristics of the sample instead of discovering the underlying structure in the population. This problem results in classifiers with a very low rate of...
| Erscheint lt. Verlag | 15.8.2018 |
|---|---|
| Sprache | englisch |
| Themenwelt | Mathematik / Informatik ► Mathematik ► Statistik |
| Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik | |
| Schlagworte | alternatives to traditional statistical models • Angewandte Wahrscheinlichkeitsrechnung u. Statistik • Applied Probability & Statistics • base classifiers for ensemble methods • Classification Trees • combination of tree predictors • constructs base classifiers in sequence • Data Mining • Data Mining Statistics • Generalized Additive Models (GAM) for classification</p> • individual classifiers • <p>Guide to Ensemble Classification Methods with Applications in R • non-linear relationships • random forest • resource to Ensemble Classification Methods with Applications in R • Spezialthemen Statistik • Statistics • Statistics Special Topics • Statistik • text to Ensemble Classification Methods with Applications in R • Understanding Ensemble Classification Methods with Applications in R • what is Ensemble Classification Methods with Applications in R |
| ISBN-10 | 1-119-42155-1 / 1119421551 |
| ISBN-13 | 978-1-119-42155-9 / 9781119421559 |
| Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
| Haben Sie eine Frage zum Produkt? |
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich