Statistical Hypothesis Testing with SAS and R - Dirk Taeger, Sonja Kuhnt

Blick ins Buch

Statistical Hypothesis Testing with SAS and R (eBook)

Dirk Taeger, Sonja Kuhnt (Autoren)

eBook Download: EPUB | PDF

2014
John Wiley & Sons (Verlag)
9781118762608 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

This book provides a reference guide to statistical tests and their application to data using SAS and R. A general summary of statistical test theory is presented, along with a general description for each test, together with necessary prerequisites, assumptions, and the formal test problem. The test statistic is stated together with annotations on its distribution, along with examples in both SAS and R. Each example contains the code to perform the test, the output, and remarks that explain necessary program parameters.

A comprehensive guide to statistical hypothesis testing with examples in SAS and R When analyzing datasets the following questions often arise: Is there a short hand procedure for a statistical test available in SAS or R? If so, how do I use it? If not, how do I program the test myself? This book answers these questions and provides an overview of the most common statistical test problems in a comprehensive way, making it easy to find and perform an appropriate statistical test. A general summary of statistical test theory is presented, along with a basic description for each test, including the necessary prerequisites, assumptions, the formal test problem and the test statistic. Examples in both SAS and R are provided, along with program code to perform the test, resulting output and remarks explaining the necessary program parameters. Key features: Provides examples in both SAS and R for each test presented. Looks at the most common statistical tests, displayed in a clear and easy to follow way. Supported by a supplementary website http://www.d-taeger.de featuring example program code. Academics, practitioners and SAS and R programmers will find this book a valuable resource. Students using SAS and R will also find it an excellent choice for reference and data analysis.

Dirk Taeger, Institute for Prevention and Occupational Medicine of the German Social Accident Insurance, Institute of the Ruhr-Universität Bochum (IPA), Bochum, Germany Sonja Kuhnt, Department of Computer Science, Dortmund University of Applied Sciences and Arts, Dortmund, Germany

Preface xiii

Part I INTRODUCTION 1

1 Statistical hypothesis testing 3

1.1 Theory of statistical hypothesis testing 3

1.2 Testing statistical hypothesis with SAS and R 4

1.3 Presentation of the statistical tests 13

References 15

Part II NORMAL DISTRIBUTION 17

2 Tests on the mean 19

2.1 One-sample tests 19

2.2 Two-sample tests 23

References 35

3 Tests on the variance 36

3.1 One-sample tests 36

3.2 Two-sample tests 41

References 47

Part III BINOMIAL DISTRIBUTION 49

4 Tests on proportions 51

4.1 One-sample tests 51

4.2 Two-sample tests 55

4.3 K-sample tests 62

References 64

Part IV OTHER DISTRIBUTIONS 65

5 Poisson distribution 67

5.1 Tests on the Poisson parameter 67

References 75

6 Exponential distribution 76

6.1 Test on the parameter of an exponential distribution 76

Reference 78

Part V CORRELATION 79

7 Tests on association 81

7.1 One-sample tests 81

7.2 Two-sample tests 94

References 98

Part VI NONPARAMETRIC TESTS 99

8 Tests on location 101

8.1 One-sample tests 101

8.2 Two-sample tests 110

8.3 K-sample tests 116

References 118

9 Tests on scale difference 120

9.1 Two-sample tests 120

References 131

10 Other tests 132

10.1 Two-sample tests 132

References 135

Part VII GOODNESS-OF-FIT TESTS 137

11 Tests on normality 139

11.1 Tests based on the EDF 139

11.2 Tests not based on the EDF 148

References 152

12 Tests on other distributions 154

12.1 Tests based on the EDF 154

12.2 Tests not based on the EDF 164

References 166

Part VIII TESTS ON RANDOMNESS 167

13 Tests on randomness 169

13.1 Run tests 169

13.2 Successive difference tests 178

References 185

Part IX TESTS ON CONTINGENCY TABLES 187

14 Tests on contingency tables 189

14.1 Tests on independence and homogeneity 189

14.2 Tests on agreement and symmetry 197

14.3 Test on risk measures 205

References 214

Part X TESTS ON OUTLIERS 217

15 Tests on outliers 219

15.1 Outliers tests for Gaussian null distribution 219

15.2 Outlier tests for other null distributions 229

References 235

Part XI TESTS IN REGRESSION ANALYSIS 237

16 Tests in regression analysis 239

16.1 Simple linear regression 239

16.2 Multiple linear regression 246

References 252

17 Tests in variance analysis 253

17.1 Analysis of variance 253

17.2 Tests for homogeneity of variances 258

References 263

Appendix A Datasets 264

Appendix B Tables 271

Glossary 284

Index 287

Chapter 1 Statistical hypothesis testing

1.1 Theory of statistical hypothesis testing

Hypothesis testing is a key tool in statistical inference next to point estimation and confidence sets. All three concepts make an inference about a population based on a sample taken from it. Hypothesis testing aims at a decision on whether or not a hypothesis on the nature of the population is supported by the sample.

In the following we shortly run through the steps of a statistical test procedure and introduce the notation used throughout this book. For a detailed mathematical explanation please refer to the book by Lehmann (1997).

We denote a sample of size by , where the are observations of identically independently distributed random variables , . Usually some further assumptions are needed concerning the nature of the mechanism generating the sample. These can be rather general assumptions like a symmetric continuous distribution. Often a parametric distribution is assumed with only parameter values unknown, for example, the Gaussian distribution with both or either unknown mean and variance. In this case hypothesis tests deal with statements on the unknown population parameters. We exemplify our general discussion by this situation.

Each of the statistical tests presented in the following chapters is introduced by a verbal description of the type of conjecture to be decided upon together with the made assumptions. Next the test problem is formalized by the null hypothesis and the alternative hypothesis . If a statement on population parameters is of interest, often the parameter space , is partitioned into disjunct sets and with , corresponding to and , respectively.

As the next building stone of a statistical test the test statistic, which is a function of the random sample, is stated. This function fulfills two criteria. First of all its value must provide insight on whether or not the null hypothesis might be true. Next the distribution of the test statistic must be known, given that the null hypothesis is true. Table 1.1 shows the four possible outcomes of a statistical test. In two of the cases the result of the test is a correct decision. Namely, a true null hypothesis is not rejected and a false null hypothesis is rejected. If the null hypothesis is true but is rejected as a result of the test, a type I error occurs. In the opposite situation that is true in nature but the test does not reject the null hypothesis, a type II error occurs.

Table 1.1 Possible results in statistical testing.

Generally, unless sample size or hypothesis are changed, a decrease in the probability of a type I error causes an increase in the probability for a type II error and vice versa. With the significance level the maximal probability of the appearance of a type I error is fixed and the critical region of the test is chosen according to this condition. If the observed value of the test statistic lies in the critical region, the null hypothesis is rejected. Hence, the error probability is under control when a decision is made against but not when the decision is for , which needs to be kept in mind while drawing conclusions from test results. If possible, the researcher's conjecture corresponds to the alternative hypothesis due to primarily controlling the type I error. However, in goodness-of-fit tests one is forced to formulate the researcher's hypothesis, that is, the specific distribution of interest, as null hypothesis as it is otherwise usually unfeasible to derive the distribution of the test statistic.

The power function measures the quality of a test. It yields the probability of rejecting the hypothesis for a given true parameter value . The test with the greatest power among all tests with a given significance level is called the most powerful test.

Traditionally a pre-specified significance level of or is selected. However, there is no reason why a different value should not be chosen.

Up to here we are in the context of the Neyman–Pearson test theory. Most statistical computer programs are not returning whether the calculated test statistic lies within the critical region or not. Instead the p-value (probability-value) is given. This is the probability to obtain the observed value of the test statistic or a value that is more extreme in the direction of the alternative hypothesis calculated when is true. If the p-value is smaller than it follows that is rejected, otherwise is not rejected.

As already mentioned in the introduction this is the common approach. For further reading on the differences please refer to Goodman (1994), Hubbard and Bayarri (2003), Johnstone (1987), and Lehmann (1993).

1.2 Testing statistical hypothesis with SAS and R

Testing statistical hypotheses with SAS and R is very convenient. A lot of tests are already integrated in these software packages. In SAS tests are invoked via procedures while R uses functions. Although many test problems are handled in this way situations may occur where a SAS procedure or a R function is not available. Reasons are manifold. The SAS Institute decides which statistical test to include in SAS. Even if a newly developed test is accepted for inclusion in SAS it takes some time to develop a new procedure or to incorporate it in an existing SASprocedure. If a test is not implemented in a SAS procedure or in the R standard packages the likelihood is high to find the test as a SAS macro or in R user packages which are available through the World Wide Web. However, in this book we have refrained from presenting tests from SAS macros or R user packages for several reasons. We do not know how long macros, program code, or user packages are supported by the programmer and are therefore available for newer versions of SAS or R. In addition it is not possible to trace if the code is correct. If a statistical test is not implemented in the SAS software as procedure or in the R standard packages we will provide an algorithm with small SAS and R code to circumvent these problems. All presented statistical tests are accompanied by an example of their use in a given dataset. So it is easy to retrace the example and to translate the code to your own datasets. Sometimes more than one SAS procedure or R function is available to perform a statistical test. We only present one way to do so.

1.2.1 Programming philosophy of SAS and R

Testing statistical hypothesis in SAS or R is not the same, while R is a matrix language orientated software, SAS follows a different philosophy (except for SAS/IML). With a matrix orientated language some calculations are easier. For instance the average of a few observations, for example, the age and of four children in a family, can be calculated with one line of code in R by applying the function mean() to the vector containing the values, c(1,4,2,5).

mean(c(1,4,2,5))

Here the numeric vector of data values to be analyzed is inserted directly in the R function. However, it is also possible to call data from a previously defined object, for example, a dataframe

children<-data.frame(age=c(1,4,2,5)) mean(children$age)

In SAS a little more effort is necessary due to the required division into data and proc steps.

data children; input age; datalines; 1 4 2 5 ; run; proc means; var age; run;

The dataset children holds the variable age with observed values and . The SAS procedure proc means calculates the mean value. This type of programming philosophy must not be a disadvantage. It can save a lot of time, because the SAS procedures are very powerful and incorporate many statistical calculations in one go.

We assume that the reader is familiar with the basic programming features of SAS or R, such as data input and output, and only remark on some important points related to conducting statistical tests. Concerning data format usually one entry per observation and a column for each variable are suitable. However, in some cases it may be required to reorganize the dataset for test procedures. We accompany our examples with small datasets (see Appendix A), such that it is easy to see how data need to be arranged for the specific test.

In SAS most statistical tests are performed with procedures, which usually follow the schema:

proc proc-name data=dataset-name options; var variable-names options; options; run;

The data= statement identifies the dataset to be analyzed. If missing, the most recent dataset is taken. In some procedures it is necessary to fix some options to set up the statistical test, for example, to define the value to test against, or if the test is one or two sided. The var statement is followed by the variables on which the test shall be performed. Sometimes further options can be stated in separate command lines, for instance requesting an exact test. Note, some procedures differ from this general set-up. The procedure proc freq as an example has no var but a table statement. Occasionally the statement class class-variable is needed indicating a grouping variable which assigns each observation to a specific group. As options of procedures can be numerous and not all of them may be needed for the treated test, we restrict our exposure to the indispensable options. The same applies to the output we present for the examples.

Conducting a statistical test in the program R usually only requires one line of code. The common layout of R functions...

Erscheint lt. Verlag	9.1.2014
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Mathematik ► Computerprogramme / Computeralgebra
	Mathematik / Informatik ► Mathematik ► Statistik
	Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik
	Technik
Schlagworte	appropriate • BASIC • Book • Common • Comprehensive • Computational & Graphical Statistics • Description • Examples • following • General • Guide • hypothesis • often • Overview • presented • Problems • questions • RechnergestÃ¼tzte u. graphische Statistik • Rechnergestützte u. graphische Statistik • SAS • Statistical • Statistical Software / R • Statistical Software / SAS • Statistics • Statistik • Statistiksoftware / R • Statistiksoftware / SAS • theory • Way
ISBN-13	9781118762608 / 9781118762608

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

EPUB (Adobe DRM)
Größe: 10,8 MB

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Zusätzliches Feature: Online Lesen
Dieses eBook können Sie zusätzlich zum Download auch online im Webbrowser lesen.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

PDF (Adobe DRM)

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Andere eBook-Ausgabe

PDF (Adobe DRM)
EPUB (Adobe DRM)