A Practical Guide to Data Mining for Business and Industry (eBook)
John Wiley & Sons (Verlag)
978-1-118-76372-8 (ISBN)
Data mining is well on its way to becoming a recognized discipline in the overlapping areas of IT, statistics, machine learning, and AI. Practical Data Mining for Business presents a user-friendly approach to data mining methods, covering the typical uses to which it is applied. The methodology is complemented by case studies to create a versatile reference book, allowing readers to look for specific methods as well as for specific applications. The book is formatted to allow statisticians, computer scientists, and economists to cross-reference from a particular application or method to sectors of interest.
Andrea Ahlemeyer-Stubbe, Director Strategic Analytics, DRAFTFCB München GmbH, Germany
Shirley Coleman, Principal Statistician, Industrial Statistics Research Unit, School of Maths and Statistics, Newcastle University, UK
Data mining is well on its way to becoming a recognized discipline in the overlapping areas of IT, statistics, machine learning, and AI. Practical Data Mining for Business presents a user-friendly approach to data mining methods, covering the typical uses to which it is applied. The methodology is complemented by case studies to create a versatile reference book, allowing readers to look for specific methods as well as for specific applications. The book is formatted to allow statisticians, computer scientists, and economists to cross-reference from a particular application or method to sectors of interest.
Andrea Ahlemeyer-Stubbe, Director Strategic Analytics, DRAFTFCB München GmbH, Germany Shirley Coleman, Principal Statistician, Industrial Statistics Research Unit, School of Maths and Statistics, Newcastle University, UK
A Practical Guide to Data Mining for Business and Industry 5
Copyright 5
Contents 7
Glossary of terms 14
Part I Data Mining Concept 23
1 Introduction 25
1.1 Aims of the Book 25
1.2 Data Mining Context 27
1.2.1 Domain Knowledge 28
1.2.2 Words to Remember 29
1.2.3 Associated Concepts 29
1.3 Global Appeal 30
1.4 Example Datasets Used in This Book 30
1.5 Recipe Structure 33
1.6 Further Reading and Resources 35
2 Data mining definition 36
2.1 Types of Data Mining Questions 37
2.1.1 Population and Sample 37
2.1.2 Data Preparation 38
2.1.3 Supervised and Unsupervised Methods 38
2.1.4 Knowledge-Discovery Techniques 40
2.2 Data Mining Process 41
2.3 Business Task: Clarification of the Business Question behind the Problem 42
2.4 Data: Provision and Processing of the Required Data 43
2.4.1 Fixing the Analysis Period 44
2.4.2 Basic Unit of Interest 45
2.4.3 Target Variables 46
2.4.4 Input Variables/Explanatory Variables 46
2.5 Modelling: Analysis of the Data 47
2.6 Evaluation and Validation during the Analysis Stage 47
2.7 Application of Data Mining Results and Learning from the Experience 50
Part II Data Mining Practicalities 53
3 All about data 55
3.1 Some Basics 56
3.1.1 Data, Information, Knowledge and Wisdom 57
3.1.2 Sources and Quality of Data 58
3.1.3 Measurement Level and Types of Data 59
3.1.4 Measures of Magnitude and Dispersion 61
3.1.5 Data Distributions 63
3.2 Data Partition: Random Samples for Training, Testing and Validation 63
3.3 Types of Business Information Systems 66
3.3.1 Operational Systems Supporting Business Processes 66
3.3.2 Analysis-Based Information Systems 67
3.3.3 Importance of Information 67
3.4 Data Warehouses 69
3.4.1 Topic Orientation 69
3.4.2 Logical Integration and Homogenisation 70
3.4.3 Reference Period 70
3.4.4 Low Volatility 70
3.4.5 Using the Data Warehouse 71
3.5 Three Components of a Data Warehouse: DBMS, DB and DBCS 72
3.5.1 Database Management System (DBMS) 73
3.5.2 Database (DB) 73
3.5.3 Database Communication Systems (DBCS) 73
3.6 Data Marts 74
3.6.1 Regularly Filled Data Marts 75
3.6.2 Comparison between Data Marts and Data Warehouses 75
3.7 A Typical Example from the Online Marketing Area 76
3.8 Unique Data Marts 76
3.8.1 Permanent Data Marts 76
3.8.2 Data Marts Resulting from Complex Analysis 78
3.9 Data Mart: Do’s and Don’ts 80
3.9.1 Do’s and Don’ts for Processes 80
3.9.2 Do’s and Don’ts for Handling 80
3.9.3 Do’s and Don’ts for Coding/Programming 81
4 Data Preparation 82
4.1 Necessity of Data Preparation 83
4.2 From Small and Long to Short and Wide 83
4.3 Transformation of Variables 87
4.4 Missing Data and Imputation Strategies 88
4.5 Outliers 91
4.6 Dealing with the Vagaries of Data 92
4.6.1 Distributions 92
4.6.2 Tests for Normality 92
4.6.3 Data with Totally Different Scales 92
4.7 Adjusting the Data Distributions 93
4.7.1 Standardisation and Normalisation 93
4.7.2 Ranking 93
4.7.3 Box–Cox Transformation 93
4.8 Binning 94
4.8.1 Bucket Method 95
4.8.2 Analytical Binning for Nominal Variables 95
4.8.3 Quantiles 95
4.8.4 Binning in Practice 96
4.9 Timing Considerations 99
4.10 Operational Issues 99
5 Analytics 100
5.1 Introduction 101
5.2 Basis of Statistical Tests 102
5.2.1 Hypothesis Tests and P Values 102
5.2.2 Tolerance Intervals 104
5.2.3 Standard Errors and Confidence Intervals 105
5.3 Sampling 105
5.3.1 Methods 105
5.3.2 Sample Sizes 106
5.3.3 Sample Quality and Stability 106
5.4 Basic Statistics for Pre-analytics 107
5.4.1 Frequencies 107
5.4.2 Comparative Tests 110
5.4.3 Cross Tabulation and Contingency Tables 111
5.4.4 Correlations 112
5.4.5 Association Measures for Nominal Variables 113
5.4.6 Examples of Output from Comparative and Cross Tabulation Tests 114
5.5 Feature Selection/Reduction of Variables 118
5.5.1 Feature Reduction Using Domain Knowledge 118
5.5.2 Feature Selection Using Chi-Square 119
5.5.3 Principal Components Analysis and Factor Analysis 119
5.5.4 Canonical Correlation, PLS and SEM 120
5.5.5 Decision Trees 120
5.5.6 Random Forests 120
5.6 Time Series Analysis 121
6 Methods 124
6.1 Methods Overview 126
6.2 Supervised Learning 127
6.2.1 Introduction and Process Steps 127
6.2.2 Business Task 127
6.2.3 Provision and Processing of the Required Data 128
6.2.4 Analysis of the Data 129
6.2.5 Evaluation and Validation of the Results (during the Analysis) 130
6.2.6 Application of the Results 130
6.3 Multiple Linear Regression for Use When Target is Continuous 131
6.3.1 Rationale of Multiple Linear Regression Modelling 131
6.3.2 Regression Coefficients 132
6.3.3 Assessment of the Quality of the Model 133
6.3.4 Example of Linear Regression in Practice 135
6.4 Regression When the Target is Not Continuous 141
6.4.1 Logistic Regression 141
6.4.2 Example of Logistic Regression in Practice 143
6.4.3 Discriminant Analysis 148
6.4.4 Log-Linear Models and Poisson Regression 150
6.5 Decision Trees 151
6.5.1 Overview 151
6.5.2 Selection Procedures of the Relevant Input Variables 156
6.5.3 Splitting Criteria 156
6.5.4 Number of Splits (Branches of the Tree) 157
6.5.5 Symmetry/Asymmetry 157
6.5.6 Pruning 157
6.6 Neural Networks 159
6.7 Which Method Produces the Best Model? A Comparison of Regression, Decision Trees and Neural Networks 163
6.8 Unsupervised Learning 164
6.8.1 Introduction and Process Steps 164
6.8.2 Business Task 165
6.8.3 Provision and Processing of the Required Data 165
6.8.4 Analysis of the Data 167
6.8.5 Evaluation and Validation of the Results (during the Analysis) 169
6.8.6 Application of the Results 170
6.9 Cluster Analysis 170
6.9.1 Introduction 170
6.9.2 Hierarchical Cluster Analysis 171
6.9.3 K-Means Method of Cluster Analysis 172
6.9.4 Example of Cluster Analysis in Practice 173
6.10 Kohonen Networks and Self-Organising Maps 173
6.10.1 Description 173
6.10.2 Example of SOMs in Practice 174
6.11 Group Purchase Methods: Association and Sequence Analysis 177
6.11.1 Introduction 177
6.11.2 Analysis of the Data 179
6.11.3 Group Purchase Methods 180
6.11.4 Examples of Group Purchase Methods in Practice 180
7 Validation and application 183
7.1 Introduction to Methods for Validation 183
7.2 Lift and Gain Charts 184
7.3 Model Stability 186
7.4 Sensitivity Analysis 189
7.5 Threshold Analytics and Confusion Matrix 191
7.6 ROC Curves 192
7.7 Cross-Validation and Robustness 193
7.8 Model Complexity 194
Part III Data Mining in Action 195
8 Marketing: Prediction 197
8.1 Recipe 1: Response Optimisation: To Find and Address the Right Number of Customers 198
8.2 Recipe 2: To Find the x% of Customers with the Highest Affinity to an Offer 208
8.3 Recipe 3: To Find the Right Number of Customers to Ignore 209
8.4 Recipe 4: To Find the x% of Customers with the Lowest Affinity to an Offer 212
8.5 Recipe 5: To Find the x% of Customers with the Highest Affinity to Buy 213
8.6 Recipe 6: To Find the x% of Customers with the Lowest Affinity to Buy 214
8.7 Recipe 7: To Find the x% of Customers with the Highest Affinity to a Single Purchase 215
8.8 Recipe 8: To Find the x% of Customers with the Highest Affinity to Sign a Long-Term Contract in Communication Areas 216
8.9 Recipe 9: To Find the x% of Customers with the Highest Affinity to Sign a Long-Term Contract in Insurance Areas 218
9 Intra-Customer Analysis 220
9.1 Recipe 10: To Find the Optimal Amount of Single Communication to Activate One Customer 221
9.2 Recipe 11: To Find the Optimal Communication Mix to Activate One Customer 222
9.3 Recipe 12: To Find and Describe Homogeneous Groups of Products 228
9.4 Recipe 13: To Find and Describe Groups of Customers with Homogeneous Usage 232
9.5 Recipe 14: To Predict the Order Size of Single Products or Product Groups 238
9.6 Recipe 15: Product Set Combination 239
9.7 Recipe 16: To Predict the Future Customer Lifetime Value of a Customer 241
10 Learning from a Small Testing Sample and Prediction 247
10.1 Recipe 17: To Predict Demographic Signs (Like Sex, Age, Education and Income) 247
10.2 Recipe 18: To Predict the Potential Customers of a Brand New Product or Service in Your Databases 258
10.3 Recipe 19: To Understand Operational Features and General Business Forecasting 263
11 Miscellaneous 266
11.1 Recipe 20: To Find Customers Who Will Potentially Churn 266
11.2 Recipe 21: Indirect Churn Based on a Discontinued Contract 271
11.3 Recipe 22: Social Media Target Group Descriptions 272
11.4 Recipe 23: Web Monitoring 276
11.5 Recipe 24: To Predict Who is Likely to Click on a Special Banner 280
12 Software and Tools : A Quick Guide 283
12.1 List of Requirements When Choosing a Data Mining Tool 283
12.2 Introduction to the Idea of Fully Automated Modelling (FAM) 287
12.2.1 Predictive Behavioural Targeting 287
12.2.2 Fully Automatic Predictive Targeting and Modelling Real-Time Online Behaviour 288
12.3 FAM Function 288
12.4 FAM Architecture 289
12.5 FAM Data Flows and Databases 290
12.6 FAM Modelling Aspects 291
12.7 FAM Challenges and Critical Success Factors 292
12.8 FAM Summary 292
13 Overviews 293
13.1 To Make Use of Official Statistics 294
13.2 How to Use Simple Maths to Make an Impression 294
13.2.1 Approximations 294
13.2.2 Absolute and Relative Values 295
13.2.3 % Change 295
13.2.4 Values in Context 295
13.2.5 Confidence Intervals 296
13.2.6 Rounding 296
13.2.7 Tables 296
13.2.8 Figures 296
13.3 Differences between Statistical Analysis and Data Mining 297
13.3.1 Assumptions 297
13.3.2 Values Missing Because ‘Nothing Happened’ 297
13.3.3 Sample Sizes 298
13.3.4 Goodness-of-Fit Tests 298
13.3.5 Model Complexity 299
13.4 How to Use Data Mining in Different Industries 299
13.5 Future Views 305
Bibliography 307
Index 318
"A Practical Guide to Data Mining for Business and
Industrygives practical tools on how information can be extracted
from masses of data. The book is very well written, in a
conversational tone that makes it enjoyable to read. The authors
are excellent communicators. If you are interested in learning
about data mining, learning to do a particular task in data mining,
looking for a textbook to use in a data mining or analytics course,
or have a problem or data analytic task you are working on, this
book would be an excellent place to start."
(Mathematical Association of America, 23 August 2014)
| Erscheint lt. Verlag | 21.3.2014 |
|---|---|
| Sprache | englisch |
| Themenwelt | Informatik ► Datenbanken ► Data Warehouse / Data Mining |
| Mathematik / Informatik ► Mathematik ► Finanz- / Wirtschaftsmathematik | |
| Mathematik / Informatik ► Mathematik ► Statistik | |
| Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik | |
| Technik | |
| Wirtschaft ► Betriebswirtschaft / Management | |
| Schlagworte | Computer Science • Database & Data Warehousing Technologies • Data Mining • Data Mining Statistics • Datenbanken u. Data Warehousing • Finanz- u. Wirtschaftsstatistik • Informatik • practical data mining, machine learning, business intelligence, applied data mining, data mining applications, data mining methods, data mining how to, data mining for business • Statistics • Statistics for Finance, Business & Economics • Statistik |
| ISBN-10 | 1-118-76372-6 / 1118763726 |
| ISBN-13 | 978-1-118-76372-8 / 9781118763728 |
| Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
| Haben Sie eine Frage zum Produkt? |
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich