Zum Hauptinhalt springen
Nicht aus der Schweiz? Besuchen Sie lehmanns.de

Modeling and Analysis of Compositional Data (eBook)

eBook Download: EPUB
2015
John Wiley & Sons (Verlag)
9781119003137 (ISBN)

Lese- und Medienproben

Modeling and Analysis of Compositional Data - Vera Pawlowsky-Glahn, Juan José Egozcue, Raimon Tolosana-Delgado
Systemvoraussetzungen
82,99 inkl. MwSt
(CHF 79,95)
Der eBook-Verkauf erfolgt durch die Lehmanns Media GmbH (Berlin) zum Preis in Euro inkl. MwSt.
  • Download sofort lieferbar
  • Zahlungsarten anzeigen

Modeling and Analysis of Compositional Data presents a practical and comprehensive introduction to the analysis of compositional data along with numerous examples to illustrate both theory and application of each method. Based upon short courses delivered by the authors, it provides a complete and current compendium of fundamental to advanced methodologies along with exercises at the end of each chapter to improve understanding, as well as data and a solutions manual which is available on an accompanying website.

Complementing Pawlowsky-Glahn's earlier collective text that provides an overview of the state-of-the-art in this field, Modeling and Analysis of Compositional Data fills a gap in the literature for a much-needed manual for teaching, self learning or consulting.



VERA PAWLOWSKY-GLAHN Department of Computer Science, Applied Mathematics, and Statistics, University of Girona, Spain

JUAN JOSÉ EGOZCUE Department of Applied Mathematics III, Technical University of Catalonia, Barcelona, Spain

RAIMON TOLOSANA-DELGADO Helmholtz Institute Freiberg for Resource Technology, Germany

VERA PAWLOWSKY-GLAHN Department of Computer Science, Applied Mathematics, and Statistics, University of Girona, Spain JUAN JOSÉ EGOZCUE Department of Applied Mathematics III, Technical University of Catalonia, Barcelona, Spain RAIMON TOLOSANA-DELGADO Helmholtz Institute Freiberg for Resource Technology, Germany

Chapter 1
Introduction


Compositional data describe parts of some whole. They are commonly presented as vectors of proportions, percentages, concentrations, or frequencies. As proportions are expressed as real numbers, one is tempted to interpret, or even analyze, them as real multivariate data. This practice can lead to paradoxes and/or misinterpretations, some of them well known even a century ago, but mostly forgotten and neglected over the years. Some simple examples illustrate the anomalous behavior of proportions when analyzed without taking into account the special characteristics of compositional data.

Example 1.1 (Intervals covering negative proportions)


Daily measurements of an air pollutant are reported as . The given interval of concentration covers a nonsensical range of concentrations that includes negative values. It is probably generated by an average of concentrations which contain some values much higher than . For instance, the following is a set of rounded random percentages: . Their mean is , while their standard deviation is . Thus a typical -interval for the mean value would be an interval covering negative proportions, namely, . A frequent procedure is to cut this interval at zero, but then the question arises on what happens to the probability assigned to the eliminated part of the interval, , and to the probability assigned to the retained part, .

Example 1.2 (Small proportions: Are they important?)


Frequently, when some components or parts of a composition are very small, they are eliminated, with the argument that they are negligible. In such a case, it is important to think about the salt in a soup. Consider a soup that is perfectly seasoned to your taste, and imagine somebody adds to the soup the same amount of salt you used, thinking that it was not yet seasoned. Probably, doubling the amount of salt will spoil it completely. To our understanding, this is a perfect example on how important a small proportion can be and why a relative scale gives you better information in this case than an absolute one. Sometimes, small proportions are added to other parts, for example, salt and other spices, but that leads to a loss of information, making the recipe insufficiently specified.

Example 1.3 (Reporting changes in proportions)


In the election to the German Bundestag, the German Liberal Party (FDP) obtained of the votes. Eleven years later, in the elections, they obtained a share of . This could be reported as an increment of percentage points. We are more used to reading that FDP increased its proportion of votes a (). In the following election, just 4 years later, the party decreased its votes by a significant , but still half of the increment that occurred between and . Nevertheless, that meant that the FDP was not anymore represented in the Bundestag, because its share () dropped below the threshold of required by the German electoral law. How can it be that increasing and decreasing gives a negative balance? Perhaps this is a bad way of reporting changes in proportions (data extracted from Wikipedia (2014)).

Reporting increments of shares in differences of percentage points have also disappointing properties, as the relative scale of proportions is ignored. In fact, an increment of percentage points represents a very important change from the result of FDP . It would be not so important if the previous result were, for instance, .

Example 1.4 (The scale of proportions)


In a given year, the annual proportion of rainy days in a desert region is , and near a mountain range it is . Some years later, these proportions have changed to and , respectively. To summarize the situation, one can assert that the rainy days in both regions have increased by . Such a statement suggests the idea of a homogeneous change in the two different regions, ignoring that the rainy days in the desert have been doubled, while in the mountain range the proportion is almost the same. Using the increment of ratios typical of election results or economic reports, the rainy days would have increased a critical in the desert, and a slightly relevant in the mountains.

Furthermore, if some analysis of the evolution of the rainy days is made in both regions, it should be guaranteed that equivalent results are obtained if the nonrainy days are analyzed. In the desert region, the annual proportion of nonrainy days has changed from to and near the mountain range from to . That represents that nonrainy days have decreased, respectively, and , which suggests almost no difference between the mountain and the desert. How can it then be that rainy days change so dramatically in the desert and nonrainy days do not change at all? A proper analysis should assure that no paradoxical results are obtained when analyzing one type of days and its complementary.

Example 1.5 (The Simpson's paradox)


The lectures on statistics started very early this morning. Students (men and women) are divided into two classrooms. Some of them arrived on time and some of them were late. Academia was interested in knowing about punctuality according to the gender of the students. Therefore, data were collected this morning during the statistics lectures. The data set is reported in Table 1.1. The paradoxical result is that, for both classrooms, the proportion of women arriving on time is greater than that of men. On the contrary, if the individuals of both classrooms are joined in a single population, the proportion of punctual men is larger than that of the women. This kind of paradoxical results are known as Simpson's paradox (Simpson, 1951; Julious and Mullee, 1994; Zee Ma, 2009). The paradox can be viewed from different points of view. The simplest one, the arithmetic perspective, is to look at the way in which proportions are aggregated: to find the proportion of on-time women in the joint population, the per classroom proportions , are averaged as , where is the number of on-time women in the classroom and is the corresponding total of women. This kind of average is ill-behaved for proportions as shown by Simpson's paradox.

Table 1.1 Number of students of two classrooms, arriving on time and being late, classified by gender. Proportions are reported under the number of students. The largest proportion of arriving on-time men and women are in boldface for easy comparison

Classroom 1 Classroom 2 Total
On time Late On time Late On time Late
Men 53 9 12 6 65 15
0.855 0.145 0.667 0.333 0.813 0.188
Women 20 2 50 18 70 20
0.909 0.091 0.735 0.265 0.778 0.222

A second point of view is to look at the total proportion of on-time women as a mean value of this proportion in the two classrooms. Each classroom is treated as a sample individual and is taken as the sample mean of the proportions. The paradoxical result suggests that mean values of proportions should be redefined carefully to get consistent results.

Example 1.6 (Spurious correlation)


The Spanish Government publishes the number of affiliations to the Social Security on a monthly basis, which is classified into the following categories depending on the type of company: agricultural, industrial, construction, and service. The 144 data, corresponding to a monthly series going from to , were downloaded from the corresponding web site (Gobierno de España, 2014). A version, prepared for processing, is available in (www.wiley.com/go/glahn/practical). First, to obtain proportions between the different types of company, the data were normalized to add to in the full composition comprising the four categories. Then, the correlation matrix was computed (see Table 1.2). Next, to analyze the behavior of the companies excluding construction, a subcomposition of three categories was obtained, suppressing the category construction and converting the three-part vector to proportions, so that the three components add up to . Again, the correlation matrix was computed (see Table 1.3). When analyzing correlations in the full composition with four parts and the subcomposition with three parts, the correlation between the proportion of agricultural and industrial companies only changed slightly, actually from to , whereas the correlation between the service companies and either agricultural or industrial companies changed dramatically, from to in the first case and from to in the second. This is a typical effect when analyzing a set of parts adding up to a constant, or a subset of the same parts, closed to any constant.

Table 1.2 Correlation of proportion of affiliations to social security in Spain according to the type of company (four-part composition: agricultural, industrial, construction, and service)

Agricultural Industrial Construction Service
Agricultural 1.0000 0.9201 0.1699
Industrial ...

Erscheint lt. Verlag 17.2.2015
Reihe/Serie Statistics in Practice
Statistics in Practice
Statistics in Practice
Sprache englisch
Themenwelt Mathematik / Informatik Informatik Datenbanken
Mathematik / Informatik Mathematik Statistik
Mathematik / Informatik Mathematik Wahrscheinlichkeit / Kombinatorik
Technik
Schlagworte Angewandte Wahrscheinlichkeitsrechnung u. Statistik • Applied Probability & Statistics • Compositional Data Analysis, compositional data theory and methods • Data Analysis • Datenanalyse • earth sciences • Environmental Geoscience • Geowissenschaften • Kompositionsdaten • Modell (Math.) • Statistics • Statistik • Umweltgeowissenschaften
ISBN-13 9781119003137 / 9781119003137
Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?
EPUBEPUB (Adobe DRM)

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belle­tristik und Sach­büchern. Der Fließ­text wird dynamisch an die Display- und Schrift­größe ange­passt. Auch für mobile Lese­geräte ist EPUB daher gut geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Mehr entdecken
aus dem Bereich
Der Leitfaden für die Praxis

von Christiana Klingenberg; Kristin Weber

eBook Download (2025)
Carl Hanser Fachbuchverlag
CHF 48,80