Multivariate Density Estimation (eBook)
John Wiley & Sons (Verlag)
978-1-118-57553-6 (ISBN)
Clarifies modern data analysis through nonparametric density estimation for a complete working knowledge of the theory and methods
Featuring a thoroughly revised presentation, Multivariate Density Estimation: Theory, Practice, and Visualization, Second Edition maintains an intuitive approach to the underlying methodology and supporting theory of density estimation. Including new material and updated research in each chapter, the Second Edition presents additional clarification of theoretical opportunities, new algorithms, and up-to-date coverage of the unique challenges presented in the field of data analysis.
The new edition focuses on the various density estimation techniques and methods that can be used in the field of big data. Defining optimal nonparametric estimators, the Second Edition demonstrates the density estimation tools to use when dealing with various multivariate structures in univariate, bivariate, trivariate, and quadrivariate data analysis. Continuing to illustrate the major concepts in the context of the classical histogram, Multivariate Density Estimation: Theory, Practice, and Visualization, Second Edition also features:
- Over 150 updated figures to clarify theoretical results and to show analyses of real data sets
- An updated presentation of graphic visualization using computer software such as R
- A clear discussion of selections of important research during the past decade, including mixture estimation, robust parametric modeling algorithms, and clustering
- More than 130 problems to help readers reinforce the main concepts and ideas presented
- Boxed theorems and results allowing easy identification of crucial ideas
- Figures in color in the digital versions of the book
- A website with related data sets
Multivariate Density Estimation: Theory, Practice, and Visualization, Second Edition is an ideal reference for theoretical and applied statisticians, practicing engineers, as well as readers interested in the theoretical aspects of nonparametric estimation and the application of these methods to multivariate data. The Second Edition is also useful as a textbook for introductory courses in kernel statistics, smoothing, advanced computational statistics, and general forms of statistical distributions.
David W. Scott, PhD, is Noah Harding Professor in the Department of Statistics at Rice University. The author of over 100 published articles, papers, and book chapters, Dr. Scott is also Fellow of the American Statistical Association (ASA) and the Institute of Mathematical Statistics. He is recipient of the ASA Founder's Award and the Army Wilks Award. His research interests include computational statistics, data visualization, and density estimation. Dr. Scott is also coeditor of Wiley Interdisciplinary Reviews: Computational Statistics and previous Editor of the Journal of Computational and Graphical Statistics.
David W. Scott, PhD, is Noah Harding Professor in the Department of Statistics at Rice University. The author of over 100 published articles, papers, and book chapters, Dr. Scott is also Fellow of the American Statistical Association (ASA) and the Institute of Mathematical Statistics. He is recipient of the ASA Founder's Award and the Army Wilks Award. His research interests include computational statistics, data visualization, and density estimation. Dr. Scott is also coeditor of Wiley Interdisciplinary Reviews: Computational Statistics and previous Editor of the Journal of Computational and Graphical Statistics.
"The book is an ideal reference for theoretical and applied statisticians, practicing engineers, as well as readers interested in the theoretical aspects of nonparametric estimation and the application of these methods to multivariate data. The second edition is also useful as a textbook for introductory courses in kernel statistics, smoothing, advanced computational statistics, and general forms of statistical distributions." (Zentralblatt MATH, 1 June 2015)
1
REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
A complete analysis of multidimensional data requires the application of an array of statistical tools—parametric, nonparametric, and graphical. Parametric analysis is the most powerful. Nonparametric analysis is the most flexible. And graphical analysis provides the vehicle for discovering the unexpected.
This chapter introduces some graphical tools for visualizing structure in multidimensional data. One set of tools focuses on depicting the data points themselves, while another set of tools relies on displaying of functions estimated from those points. Visualization and contouring of functions in more than two dimensions is introduced. Some mathematical aspects of the geometry of higher dimensions are reviewed. These results have consequences for nonparametric data analysis.
1.1 INTRODUCTION
Classical linear multivariate statistical models rely primarily on analysis of the covariance matrix. So powerful are these techniques that analysis is almost routine for datasets with hundreds of variables. While the theoretical basis of parametric models lies with the multivariate normal density, these models are applied in practice to many kinds of data. Parametric studies provide neat inferential summaries and parsimonious representation of the data.
For many problems second-order information is inadequate. Advanced modeling or simple variable transformations may provide a solution. When no simple parametric model is forthcoming, many researchers have opted for fully “unparametric” methods that may be loosely collected under the heading of exploratory data analysis. Such analyses are highly graphical; but in a complex non-normal setting, a graph may provide a more concise representation than a parametric model, because a parametric model of adequate complexity may involve hundreds of parameters.
There are some significant differences between parametric and nonparametric modeling. The focus on optimality in parametric modeling does not translate well to the nonparametric world. For example, the histogram might be proved to be an inadmissible estimator, but that theoretical fact should not be taken to suggest histograms should not be used. Quite to the contrary, some methods that are theoretically superior are almost never used in practice. The reason is that the ordering of algorithms is not absolute, but is dependent not only on the unknown density but also on the sample size. Thus the histogram is generally superior for small samples regardless of its asymptotic properties. The exploratory school is at the other extreme, rejecting probabilistic models, whose existence provides the framework for defining optimality.
In this book, an intermediate point of view is adopted regarding statistical efficacy. No nonparametric estimate is considered wrong; only different components of the solution are emphasized. Much effort will be devoted to the data-based calibration problem, but nonparametric estimates can be reasonably calibrated in practice without too much difficulty. The “curse of optimality” might suggest that this is an illogical point of view. However, if the notion that optimality is all important is adopted, then the focus becomes matching the theoretical properties of an estimator to the assumed properties of the density function. Is it a gross inefficiency to use a procedure that requires only two continuous derivatives when the curve in fact has six continuous derivatives? This attitude may have some formal basis but should be discouraged as too heavy-handed for nonparametric thinking. A more relaxed attitude is required. Furthermore, many “optimal” nonparametric procedures are unstable in a manner that slightly inefficient procedures are not. In practice, when faced with the application of a procedure that requires six derivatives, or some other assumption that cannot be proved in practice, it is more important to be able to recognize the signs of estimator failure than to worry too much about assumptions. Detecting failure at the level of a discontinuous fourth derivative is a bit extreme, but certainly the effects of simple discontinuities should be well understood. Thus only for the purposes of illustration are the best assumptions given.
The notions of efficiency and admissibility are related to the choice of a criterion, which can only imperfectly measure the quality of a nonparametric estimate. Unlike optimal parametric estimates that are useful for many purposes, nonparametric estimates must be optimized for each application. The extra work is justified by the extra flexibility. As the choice of criterion is imperfect, so then is the notion of a single optimal estimator. This attitude reflects not sloppy thinking, but rather the imperfect relationship between the practical and theoretical aspects of our methods. Too rigid a point of view leads one to a minimax view of the world where nonparametric methods should be abandoned because there exist difficult problems.
Visualization is an important component of nonparametric data analysis. Data visualization is the focus of exploratory methods, ranging from simple scatterplots to sophisticated dynamic interactive displays. Function visualization is a significant component of nonparametric function estimation, and can draw on the relevant literature in the fields of scientific visualization and computer graphics. The focus of multivariate data analysis on points and scatterplots has meant that the full impact of scientific visualization has not yet been realized. With the new emphasis on smooth functions estimated nonparametrically, the fruits of visualization will be attained. Banchoff (1986) has been a pioneer in the visualization of higher dimensional mathematical surfaces. Curiously, the surfaces of interest to mathematicians contain singularities and discontinuities, all producing striking pictures when projected to the plane. In statistics, visualization of the smooth density surface in four, five, and six dimensions cannot rely on projection, as projections of smooth surfaces to the plane show nothing. Instead, the emphasis is on contouring in three dimensions and slicing of surfaces beyond. The focus on three and four dimensions is natural because one and two are so well understood. Beyond four dimensions, the ability to explore surfaces carefully decreases rapidly due to the curse of dimensionality. Fortunately, statistical data seldom display structure in more than five dimensions, so guided projection to those dimensions may be adequate. It is these threshold dimensions from three to five that are and deserve to be the focus of our visualization efforts.
There is a natural flow among the parametric, exploratory, and nonparametric procedures that represents a rational approach to statistical data analysis. Begin with a fully exploratory point of view in order to obtain an overview of the data. If a probabilistic structure is present, estimate that structure nonparametrically and explore it visually. Finally, if a linear model appears adequate, adopt a fully parametric approach. Each step conceptually represents a willingness to more strongly smooth the raw data, finally reducing the dimension of the solution to a handful of interesting parameters. With the assumption of normality, the mind's eye can easily imagine the d-dimensional egg-shaped elliptical data clusters. Some statisticians may prefer to work in the reverse order, progressing to exploratory methodology as a diagnostic tool for evaluating the adequacy of a parametric model fit.
There are many excellent references that complement and expand on this subject. In exploratory data analysis, references include Tukey (1977), Tukey and Tukey (1981), Cleveland and McGill (1988), and Wang (1978).
In density estimation, the classic texts of Tapia and Thompson (1978), Wertz (1978), and Thompson and Tapia (1990) first indicated the power of the nonparametric approach for univariate and bivariate data. Silverman (1986) has provided a further look at applications in this setting. Prakasa Rao (1983) has provided a theoretical survey with a lengthy bibliography. Other texts are more specialized, some focusing on regression (Müller, 1988; Härdle, 1990), some on a specific error criterion (Devroye and Györfi, 1985; Devroye, 1987), and some on particular solution classes such as splines (Eubank, 1988; Wahba, 1990). A discussion of additive models may be found in Hastie and Tibshirani (1990).
1.2 HISTORICAL PERSPECTIVE
One of the roots of modern statistical thought can be traced to the empirical discovery of correlation by Galton in 1886 (Stigler, 1986). Galton's ideas quickly reached Karl Pearson. Although best remembered for his methodological contributions such as goodness-of-fit tests, frequency curves, and biometry, Pearson was a strong proponent of the geometrical representation of statistics. In a series of lectures a century ago in November 1891 at Gresham College in London, Pearson spoke on a wide-ranging set of topics (Pearson, 1938). He discussed the foundations of the science of pure statistics and its many divisions. He discussed the collection of observations. He described the classification and representation of data using both numerical and geometrical descriptors. Finally, he emphasized statistical methodology and discovery of statistical laws. The syllabus for his lecture of November 11, 1891, includes this cryptic note:
Erroneous opinion that Geometry is only a means of popular representation: it is a fundamental method of investigating and analysing statistical...
| Erscheint lt. Verlag | 12.3.2015 |
|---|---|
| Reihe/Serie | Wiley Series in Probability and Statistics |
| Wiley Series in Probability and Statistics | Wiley Series in Probability and Statistics |
| Sprache | englisch |
| Themenwelt | Mathematik / Informatik ► Mathematik ► Statistik |
| Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik | |
| Technik | |
| Schlagworte | Big Data • Clustering • Computational & Graphical Statistics • Computational Statistics • Computer Graphics • Data Analysis • Density Estimation • density estimators • Engineering statistics • frequency polygons • histogram • kernel theory • mixture estimation • Multivariate Analyse • multivariate analysis • multivariate data • nonparametric estimation • quadrivariate data • Rechnergestützte u. graphische Statistik • Rechnergestützte u. graphische Statistik • robust parametric modeling algorithms • smoothing • Statistical Theory • Statistics • Statistik • Statistik in den Ingenieurwissenschaften • trivariate data • univariate data • Visualization |
| ISBN-10 | 1-118-57553-9 / 1118575539 |
| ISBN-13 | 978-1-118-57553-6 / 9781118575536 |
| Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
| Haben Sie eine Frage zum Produkt? |
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich