Information Quality (eBook)
John Wiley & Sons (Verlag)
978-1-118-89065-3 (ISBN)
Provides an important framework for data analysts in assessing the quality of data and its potential to provide meaningful insights through analysis
Analytics and statistical analysis have become pervasive topics, mainly due to the growing availability of data and analytic tools. Technology, however, fails to deliver insights with added value if the quality of the information it generates is not assured. Information Quality (InfoQ) is a tool developed by the authors to assess the potential of a dataset to achieve a goal of interest, using data analysis. Whether the information quality of a dataset is sufficient is of practical importance at many stages of the data analytics journey, from the pre-data collection stage to the post-data collection and post-analysis stages. It is also critical to various stakeholders: data collection agencies, analysts, data scientists, and management.
This book:
- Explains how to integrate the notions of goal, data, analysis and utility that are the main building blocks of data analysis within any domain.
- Presents a framework for integrating domain knowledge with data analysis.
- Provides a combination of both methodological and practical aspects of data analysis.
- Discusses issues surrounding the implementation and integration of InfoQ in both academic programmes and business / industrial projects.
- Showcases numerous case studies in a variety of application areas such as education, healthcare, official statistics, risk management and marketing surveys.
- Presents a review of software tools from the InfoQ perspective along with example datasets on an accompanying website.
This book will be beneficial for researchers in academia and in industry, analysts, consultants, and agencies that collect and analyse data as well as undergraduate and postgraduate courses involving data analysis.
Ron S. Kenett, KPA Ltd. and University of Torino, Turin, Italy
Ron S. Kenett, Chairman and CEO of the KPA Group and KPA Ltd., Research Professor at the University of Turin, Italy, International Professor Associate at the Center for Research in Risk Engineering, NYU-Poly, New York, USA and Visiting Professor at the Faculty of Economics, University of Ljubljana, Slovenia. He has over 25 years of experience in restructuring and improving the competitive position of organizations by integrating statistical methods, process analysis, supporting technologies and modern human resource management systems. Ron Kenett is Editor in Chief of the Wiley Encyclopedia of Statistics in Quality and Reliability, a Fellow of the Royal Statistical Society, Senior Member of the American Society for Quality, Past President of the Israeli Statistical Association and Past President of ENBIS, the European Network for Business and Industrial Statistics and is the 2013 Greenfield Medalist of the Royal Statistical Society.
Galit Shmueli, Indian School of Business, India
Galit Shmueli is SRITNE Chaired Professor of Data Analytics and Associate Professor of Statistics & Information Systems at the Indian School of Business. She is best known for her research and teaching in business analytics, with a focus on statistical and data mining methods for contemporary data and applications in information systems and healthcare. Dr. Shmueli's research has been published in the statistics, management, information systems, and marketing literature. She authors over seventy journal articles, books, textbooks and book chapters, including the popular textbook Data Mining for Business Intelligence and Practical Time Series Forecasting. Dr. Shmueli is an award-winning teacher and speaker on data analytics. She has taught at Carnegie Mellon University, University of Maryland, the Israel Institute of Technology, Statistics.com and the Indian School of Business.
Provides an important framework for data analysts in assessing the quality of data and its potential to provide meaningful insights through analysis Analytics and statistical analysis have become pervasive topics, mainly due to the growing availability of data and analytic tools. Technology, however, fails to deliver insights with added value if the quality of the information it generates is not assured. Information Quality (InfoQ) is a tool developed by the authors to assess the potential of a dataset to achieve a goal of interest, using data analysis. Whether the information quality of a dataset is sufficient is of practical importance at many stages of the data analytics journey, from the pre-data collection stage to the post-data collection and post-analysis stages. It is also critical to various stakeholders: data collection agencies, analysts, data scientists, and management. This book: Explains how to integrate the notions of goal, data, analysis and utility that are the main building blocks of data analysis within any domain. Presents a framework for integrating domain knowledge with data analysis. Provides a combination of both methodological and practical aspects of data analysis. Discusses issues surrounding the implementation and integration of InfoQ in both academic programmes and business / industrial projects. Showcases numerous case studies in a variety of application areas such as education, healthcare, official statistics, risk management and marketing surveys. Presents a review of software tools from the InfoQ perspective along with example datasets on an accompanying website. This book will be beneficial for researchers in academia and in industry, analysts, consultants, and agencies that collect and analyse data as well as undergraduate and postgraduate courses involving data analysis.
Ron S. Kenett, KPA Ltd. and University of Torino, Turin, Italy Ron S. Kenett, Chairman and CEO of the KPA Group and KPA Ltd., Research Professor at the University of Turin, Italy, International Professor Associate at the Center for Research in Risk Engineering, NYU-Poly, New York, USA and Visiting Professor at the Faculty of Economics, University of Ljubljana, Slovenia. He has over 25 years of experience in restructuring and improving the competitive position of organizations by integrating statistical methods, process analysis, supporting technologies and modern human resource management systems. Ron Kenett is Editor in Chief of the Wiley Encyclopedia of Statistics in Quality and Reliability, a Fellow of the Royal Statistical Society, Senior Member of the American Society for Quality, Past President of the Israeli Statistical Association and Past President of ENBIS, the European Network for Business and Industrial Statistics and is the 2013 Greenfield Medalist of the Royal Statistical Society. Galit Shmueli, Indian School of Business, India Galit Shmueli is SRITNE Chaired Professor of Data Analytics and Associate Professor of Statistics & Information Systems at the Indian School of Business. She is best known for her research and teaching in business analytics, with a focus on statistical and data mining methods for contemporary data and applications in information systems and healthcare. Dr. Shmueli's research has been published in the statistics, management, information systems, and marketing literature. She authors over seventy journal articles, books, textbooks and book chapters, including the popular textbook Data Mining for Business Intelligence and Practical Time Series Forecasting. Dr. Shmueli is an award-winning teacher and speaker on data analytics. She has taught at Carnegie Mellon University, University of Maryland, the Israel Institute of Technology, Statistics.com and the Indian School of Business.
Foreword ix
About the authors xi
Preface xii
Quotes about the book xv
About the companion website xviii
PART I THE INFORMATION QUALITY FRAMEWORK 1
1 Introduction to information quality 3
2 Quality of goal, data quality, and analysis quality 18
3 Dimensions of information quality and InfoQ assessment 31
4 InfoQ at the study design stage 53
5 InfoQ at the postdata collection stage 67
PART II APPLICATIONS OF InfoQ 79
6 Education 81
7 Customer surveys 109
8 Healthcare 134
9 Risk management 160
10 Official statistics 181
PART III IMPLEMENTING InfoQ 219
11 InfoQ and reproducible research 221
12 InfoQ in review processes of scientific publications 234
13 Integrating InfoQ into data science analytics programs, research methods courses, and more 252
14 InfoQ support with R 265
15 InfoQ support with Minitab 295
16 InfoQ support with JMP 324
Index 351
1
Introduction to information quality
1.1 Introduction
Suppose you are conducting a study on online auctions and consider purchasing a dataset from eBay, the online auction platform, for the purpose of your study. The data vendor offers you four options that are within your budget:
- Data on all the online auctions that took place in January 2012
- Data on all the online auctions, for cameras only, that took place in 2012
- Data on all the online auctions, for cameras only, that will take place in the next year
- Data on a random sample of online auctions that took place in 2012
Which option would you choose? Perhaps none of these options are of value? Of course, the answer depends on the goal of the study. But it also depends on other considerations such as the analysis methods and tools that you will be using, the quality of the data, and the utility that you are trying to derive from the analysis. In the words of David Hand (2008):
Statisticians working in a research environment… may well have to explain that the data are inadequate to answer a particular question.
While those experienced with data analysis will find this dilemma familiar, the statistics and related literature do not provide guidance on how to approach this question in a methodical fashion and how to evaluate the value of a dataset in such a scenario.
Statistics, data mining, econometrics, and related areas are disciplines that are focused on extracting knowledge from data. They provide a toolkit for testing hypotheses of interest, predicting new observations, quantifying population effects, and summarizing data efficiently. In these empirical fields, measurable data is used to derive knowledge. Yet, a clean, exact, and complete dataset, which is analyzed professionally, might contain no useful information for the problem under investigation. In contrast, a very “dirty” dataset, with missing values and incomplete coverage, can contain useful information for some goals. In some cases, available data can even be misleading (Patzer, 1995, p. 14):
Data may be of little or no value, or even negative value, if they misinform.
The focus of this book is on assessing the potential of a particular dataset for achieving a given analysis goal by employing data analysis methods and considering a given utility. We call this concept information quality (InfoQ). We propose a formal definition of InfoQ and provide guidelines for its assessment. Our objective is to offer a general framework that applies to empirical research. Such element has not received much attention in the body of knowledge of the statistics profession and can be considered a contribution to both the theory and the practice of applied statistics (Kenett, 2015).
A framework for assessing InfoQ is needed both when designing a study to produce findings of high InfoQ as well as at the postdesign stage, after the data has been collected. Questions regarding the value of data to be collected, or that have already been collected, have important implications both in academic research and in practice. With this motivation in mind, we construct the concept of InfoQ and then operationalize it so that it can be implemented in practice.
In this book, we address and tackle a high‐level issue at the core of any data analysis. Rather than concentrate on a specific set of methods or applications, we consider a general concept that underlies any empirical analysis. The InfoQ framework therefore contributes to the literature on statistical strategy, also known as metastatistics (see Hand, 1994).
1.2 Components of InfoQ
Our definition of InfoQ involves four major components that are present in every data analysis: an analysis goal, a dataset, an analysis method, and a utility (Kenett and Shmueli, 2014). The discussion and assessment of InfoQ require examining and considering the complete set of its components as well as the relationships between the components. In such an evaluation we also consider eight dimensions that deconstruct the InfoQ concept. These dimensions are presented in Chapter 3. We start our introduction of InfoQ by defining each of its components.
Before describing each of the four InfoQ components, we introduce the following notation and definitions to help avoid confusion:
- g denotes a specific analysis goal.
- X denotes the available dataset.
- f is an empirical analysis method.
- U is a utility measure.
We use subscript indices to indicate alternatives. For example, to convey K different analysis goals, we use g1, g2,…, gK; J different methods of analysis are denoted f1, f2,…, fJ.
Following Hand’s (2008) definition of statistics as “the technology of extracting meaning from data,” we can think of the InfoQ framework as one for evaluating the application of a technology (data analysis) to a resource (data) for a given purpose.
1.2.1 Goal (g)
Data analysis is used for a variety of purposes in research and in industry. The term “goal” can refer to two goals: the high‐level goal of the study (the “domain goal”) and the empirical goal (the “analysis goal”). One starts from the domain goal and then converts it into an analysis goal. A classic example is translating a hypothesis driven by a theory into a set of statistical hypotheses.
There are various classifications of study goals; some classifications span both the domain and analysis goals, while other classification systems focus on describing different analysis goals.
One classification approach divides the domain and analysis goals into three general classes: causal explanation, empirical prediction, and description (see Shmueli, 2010; Shmueli and Koppius, 2011). Causal explanation is concerned with establishing and quantifying the causal relationship between inputs and outcomes of interest. Lab experiments in the life sciences are often intended to establish causal relationships. Academic research in the social sciences is typically focused on causal explanation. In the social science context, the causality structure is based on a theoretical model that establishes the causal effect of some constructs (abstract concepts) on other constructs. The data collection stage is therefore preceded by a construct operationalization stage, where the researcher establishes which measurable variables can represent the constructs of interest. An example is investigating the causal effect of parents’ intelligence on their children’s intelligence. The construct “intelligence” can be measured in various ways, such as via IQ tests. The goal of empirical prediction differs from causal explanation. Examples include forecasting future values of a time series and predicting the output value for new observations given a set of input variables. Examples include recommendation systems on various websites, which are aimed at predicting services or products that the user is most likely to be interested in. Predictions of the economy are another type of predictive goal, with forecasts of particular economic measures or indices being of interest. Finally, descriptive goals include quantifying and testing for population effects by using data summaries, graphical visualizations, statistical models, and statistical tests.
A different, but related goal classification approach (Deming, 1953) introduces the distinction between enumerative studies, aimed at answering the question “how many?,” and analytic studies, aimed at answering the question “why?”
A third classification (Tukey, 1977) classifies studies into exploratory and confirmatory data analysis.
Our use of the term “goal” includes all these different types of goals and goal classifications. For examples of such goals in the context of customer satisfaction surveys, see Chapter 7 and Kenett and Salini (2012).
1.2.2 Data (X)
Data is a broadly defined term that includes any type of data intended to be used in the empirical analysis. Data can arise from different collection instruments: surveys, laboratory tests, field experiments, computer experiments, simulations, web searches, mobile recordings, observational studies, and more. Data can be primary, collected specifically for the purpose of the study, or secondary, collected for a different reason. Data can be univariate or multivariate, discrete, continuous, or mixed. Data can contain semantic unstructured information in the form of text, images, audio, and video. Data can have various structures, including cross‐sectional data, time series, panel data, networked data, geographic data, and more. Data can include information from a single source or from multiple sources. Data can be of any size (from a single observation in case studies to “big data” with zettabytes) and any dimension.
1.2.3 Analysis (f)
We use the general term data analysis to encompass any empirical analysis applied to data. This includes statistical models and methods (parametric, semiparametric, nonparametric, Bayesian and classical, etc.), data mining algorithms, econometric models, graphical methods, and operations research methods (such as simplex optimization). Methods can be as simple as summary statistics or complex multilayer models, computationally simple or...
| Erscheint lt. Verlag | 13.10.2016 |
|---|---|
| Sprache | englisch |
| Themenwelt | Mathematik / Informatik ► Informatik ► Datenbanken |
| Mathematik / Informatik ► Mathematik ► Statistik | |
| Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik | |
| Technik | |
| Schlagworte | Analytic tools • Computer Science • Data Analysis • Data Mining & Knowledge Discovery • Data Mining u. Knowledge Discovery • Dataset • Datenanalyse • Design • Finanz- u. Wirtschaftsstatistik • Framework • General • Given • goal • infoq • Informatik • jmp index • MINITAB • postdata • quality • relevant • stage • statisticians • Statistics • Statistics for Finance, Business & Economics • Statistik • Study • Work |
| ISBN-10 | 1-118-89065-5 / 1118890655 |
| ISBN-13 | 978-1-118-89065-3 / 9781118890653 |
| Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
| Haben Sie eine Frage zum Produkt? |
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich