Data Mining and Data Visualization (eBook)
800 Seiten
Elsevier Science (Verlag)
978-0-08-045940-0 (ISBN)
Key Features:
- Distinguished contributors who are international experts in aspects of data mining
- Includes data mining approaches to non-numerical data mining including text data, Internet traffic data, and geographic data
- Highly topical discussions reflecting current thinking on contemporary technical issues, e.g. streaming data
- Discusses taxonomy of dataset sizes, computational complexity, and scalability usually ignored in most discussions
- Thorough discussion of data visualization issues blending statistical, human factors, and computational insights
· Distinguished contributors who are international experts in aspects of data mining
· Includes data mining approaches to non-numerical data mining including text data, Internet traffic data, and geographic data
· Highly topical discussions reflecting current thinking on contemporary technical issues, e.g. streaming data
· Discusses taxonomy of dataset sizes, computational complexity, and scalability usually ignored in most discussions
· Thorough discussion of data visualization issues blending statistical, human factors, and computational insights
Data Mining and Data Visualization focuses on dealing with large-scale data, a field commonly referred to as data mining. The book is divided into three sections. The first deals with an introduction to statistical aspects of data mining and machine learning and includes applications to text analysis, computer intrusion detection, and hiding of information in digital files. The second section focuses on a variety of statistical methodologies that have proven to be effective in data mining applications. These include clustering, classification, multivariate density estimation, tree-based methods, pattern recognition, outlier detection, genetic algorithms, and dimensionality reduction. The third section focuses on data visualization and covers issues of visualization of high-dimensional data, novel graphical techniques with a focus on human factors, interactive graphics, and data visualization using virtual reality. This book represents a thorough cross section of internationally renowned thinkers who are inventing methods for dealing with a new data paradigm. - Distinguished contributors who are international experts in aspects of data mining- Includes data mining approaches to non-numerical data mining including text data, Internet traffic data, and geographic data- Highly topical discussions reflecting current thinking on contemporary technical issues, e.g. streaming data- Discusses taxonomy of dataset sizes, computational complexity, and scalability usually ignored in most discussions- Thorough discussion of data visualization issues blending statistical, human factors, and computational insights
front cover 1
copyright 6
front matter 7
Preface 7
Table of contents 9
Contributors 15
body 17
1. Statistical Data Mining 17
Introduction 1 17
Computational complexity 18
Order of magnitude considerations 18
Feasibility limits due to CPU performance 20
Feasibility limits due to file transfer performance 23
Feasibility limits due to visual resolution 24
The computer science roots of data mining 25
Knowledge discovery in databases and data mining 25
Association rules 27
Data preparation 30
Missing values and outliers 31
Quantization 33
Databases 35
SQL 35
Data cubes and OLAP 36
Statistical methods for data mining 37
Density estimation 37
Cluster analysis 40
Hierarchical clustering 41
The number of groups problem 42
Artificial neural networks 43
The biological basis 43
Functioning of an artificial neural network 43
Back propagation 45
Visual data mining 45
The four stages of data graphics 46
Graphics constructs for visual data mining 46
Example 1 - PRIM 7 data 48
Example 2 - iterative denoising with hyperspectral data 50
Streaming data 53
Recursive analytic formulations 54
Counts, moments and densities 54
Evolutionary graphics 56
Waterfall diagrams and transient geographic mapping 56
Block-recursive plots and conditional plots 58
A final word 60
Acknowledgements 1 60
References 1 60
2. From Data Mining to Knowledge Mining 63
Introduction 2 63
Knowledge generation operators 65
Discovering rules and patterns via AQ learning 65
Types of problems in learning from examples 68
Clustering of entities into conceptually meaningful categories 69
Automated improvement of the search space: constructive induction 71
Reducing the amount of data: selecting representative examples 72
Integrating qualitative and quantitative methods of numerical discovery 72
Predicting processes qualitatively 73
Knowledge improvement via incremental learning 74
Summarizing the logical data analysis approach 75
Strong patterns vs. complete and consistent rules 76
Ruleset visualization via concept association graphs 78
Integration of knowledge generation operators 82
Summary 2 85
Acknowledgements 2 86
References 2 87
3. Mining Computer Securitycomputer security Data 93
Introduction 3 93
Basic TCP/IP 94
Overview of networking 94
The threat 100
Probes and scans 101
Denial of service attacks 102
Gaining access 107
Network monitoring 108
TCP sessions 113
Signatures versus anomalies 117
User profiling 118
Program profiling 120
Conclusions 3 123
References 3 123
4. Data Mining of Text Files 125
4. Introduction and background 125
Natural language processing at the word and sentence level 126
Hidden Markov models 126
Probabilistic context-free grammars 128
Word sense disambiguation 131
Supervised disambiguation 131
Unsupervised disambiguation 132
Approaches beyond the word and sentence level 135
Information retrieval 135
Vector space model 135
Generic implementation. 135
Using term weights. 137
Latent Semantic Indexing (LSI) 138
Other approaches 138
The bigram proximity matrix 138
Measures of semantic similarity. 139
Matching coefficient 139
Jaccard coefficient 140
Ochiai measure (also called cosine) 140
L1 distance 140
Information radius measure (IRad) 140
Document classification via supervised learning. 140
Document classification via model-based clustering. 141
Towards knowledge discovery 143
WEBSOM 145
Summary 4 145
References 4 146
5. Text Data Mining with Minimal Spanning Trees 149
Introduction 5 149
Approach 149
Results 5 156
Datasets 156
Feature extraction 156
Automated serendipity extraction on the Science News data set with no user driven focus of attention 157
Automated serendipity extraction on the ONR ILIR data set with no user driven focus of attention 161
Automated serendipity extraction on the Science News data set with user driven focus of attention 165
Clustering results on the ONR ILIR dataset 173
Clustering results on the Science News dataset 181
Conclusions 5 184
Acknowledgements 5 185
References 5 185
6. Information Hiding: Steganography and Steganalysis 187
Introduction 6 187
Image formats 188
Steganography 190
Embedding by modifying carrier bits 191
Embedding using pairs of values 194
Steganalysis 195
Relationship of steganography to watermarking 197
Literature survey 200
Conclusions 6 202
References 6 202
7. Canonical Variate Analysis and Related Methods for Reduction of Dimensionality and Graphical Representation 205
Introduction 7 205
Canonical coordinates 206
Mahalanobis space 206
Computation of SVD. 207
Canonical coordinates 207
Graphical display of profiles and variables 208
Canonical coordinates for profiles 208
Canonical coordinates for variables 209
Loss of information due to dimensionality reduction 210
An example 211
Typical (or eigen) profiles 213
Principal component analysis 213
Preprocessing of data 213
Individual and biplots 215
P (profile) plot. 215
V (variable) plot. 215
P0 (unweighted profile) plot. 216
V0 (unweighted variable) plot. 216
PV0 biplot. 216
V0n (unweighted normalized variable) plot. 216
Ps (standardized profile) plot. 216
Vn (normalized variable) plot. 216
P0Vn biplot. 216
Vs (standardized variable) plot. 216
PVs biplot. 216
Two-way contingency tables (correspondence analysis) 217
Discussion 7 225
References 7 226
8. Pattern Recognition 229
Background 8 229
Basics 230
Practical classification rules 232
Linear discriminant analysis 233
Logistic discrimination 234
The naive Bayes model 235
The perceptron 236
Tree classifiers 237
Local nonparametric methods 238
Neural networks 239
Support vector machines 240
Other approaches 241
Other issues 8 242
Further reading 8 243
References 8 243
9. Multidimensional Density Estimation 245
Introduction 9 245
Classical density estimators 246
Properties of histograms 247
Maximum likelihood and histograms 248
L2 theory of histograms 249
Practical histogram rules 250
Frequency polygons 253
Multivariate frequency curves 254
Kernel estimators 255
Averaged shifted histograms 255
Kernel estimators 256
Multivariate kernel options 258
Locally adaptive estimators 259
Balloon estimators 259
Sample point estimators 260
Parameterization of sample-point estimators 261
Estimating bandwidth matrices 263
Other estimators 264
Mixture density estimation 264
Fitting mixture models 265
An example 267
Visualization of densities 268
Higher dimensions 271
Curse of dimensionality 273
Discussion 9 274
References 9 274
10. Multivariate Outlier Detection and Robustness 279
Introduction 10 279
Multivariate location and scatter 280
The need for robustness 280
Description of the MCD 281
The C-step 282
Computational improvements 283
The FAST-MCD algorithm 284
Examples 286
Multiple regression 288
Multivariate regression 294
Classification 298
Principal component analysis 299
Classical PCA 299
Robust PCA 301
Diagnostic plot 305
Example 305
Principal component regression 308
Computation 308
Selecting the number of components 308
Example 309
Partial Least Squares Regression 312
Some other multivariate frameworks 313
Availability 10 313
Acknowledgements 10 316
References 10 316
11. Classification and Regression Trees, Bagging, and Boosting 319
Introduction 11 319
Classification and regression trees 319
Bagging and boosting 321
Using CART to create a classification tree 322
Classification trees 322
Overview of how CART creates a tree 323
Determining the predicted class for a terminal node 324
Selection of splits to create a partition 325
Estimating the misclassification rate and selecting the right-sized tree 327
Alternative approaches 330
Using CART to create a regression tree 331
Other issues pertaining to CART 333
Interpretation 333
Nonoptimality 333
Missing values 333
Bagging 334
Motivation for the method 334
When and how bagging works 336
Boosting 339
AdaBoost 340
Some related methods 341
When and how boosting works 342
References 11 344
12. Fast Algorithms for Classification Using Class Cover Catch Digraphs 347
Introduction 12 347
Class cover catch digraphs 348
CCCD for classification 350
Cluster catch digraph 354
Fast algorithms 356
Further enhancements 359
Streaming data 360
Examples using the fast algorithms 362
Sloan Digital Sky Survey 367
Text processing 371
Discussion 12 373
Acknowledgements 12 373
References 12 374
13. On Genetic Algorithms and their Applications 375
Introduction 13 375
History 376
Genetic algorithms 377
Calculus-based schemes 378
Enumerative-based optimization schemes 379
Genetic algorithms - an example 379
Operational functionality of genetic algorithms 381
The reproduction operator 382
The crossover operator 383
The mutation operator 384
Encryption and other considerations 385
Schemata 386
Generalized penalty methods 388
Multi-objective optimization 391
Fuzzy logic controller 392
Mathematical underpinnings 394
Mathematical analysis 394
Schema Theorem 394
Hybridization 396
Genetic algorithm fitness 396
Scaled fitness 396
Windowing technique 396
Linear normalization technique 397
Fitness technique 397
High penalty. 397
Moderate penalty. 397
Elimination. 397
Techniques for attaining optimization 397
Elitism 397
Linear probability 398
Steady-state technique 398
Advanced crossover techniques 398
Two-point crossover 399
Uniform crossover 399
Partially mixed crossover 399
Uniform order-based crossover 400
Mutation 400
Uniform order-based mutation 400
Advanced mutation 401
Genetic algorithm parameters 401
Multi-parameters 401
Concatenated, multi-parameter, mapped, fixed-point coding 402
Exploitable techniques 402
Inversion operator 402
Addition operator 402
Deletion operator 402
Closing remarks 402
Acknowledgements 13 403
References 13 403
Further reading 13 404
14. Computational Methods for High-Dimensional Rotations in Data Visualization 407
Introduction 14 407
Applications. 413
Terminology. 413
Tools for constructing plane and frame interpolations: orthonormal frames and planar rotations 413
Minimal subspace restriction 414
Planar rotations 415
Calculation and control of speed 416
Outline of an algorithm for interpolation 418
Interpolating paths of planes 419
Interpolating paths of frames 422
Orthogonal matrix paths and optimal paths for full-dimensional tours 423
Givens paths 424
Householder paths 426
Conclusions 14 427
References 14 428
15. Some Recent Graphics Templates and Software for Showing Statistical Summaries 431
Introduction 15 431
Background for quantitative graphics design 433
General guidance 434
Challenging convention 435
Templates and GUIs 435
The template for linked micromap (LM) plots 436
Micromap variations 438
Statistical panel variations 438
Name panel variations 440
Interactive extensions 440
Dynamically conditioned choropleth maps 443
Self-similar coordinates plots 447
Closing remarks 15 450
Acknowledgements 15 451
References 15 451
16. Interactive Statistical Graphics: the Paradigm of Linked Views 453
Graphics, statistics and the computer 453
Literature review 454
Software review 457
The interactive paradigm 460
Data displays 462
General definition 462
Sample population 465
Model operations 468
Identity model 469
Variable transformations 469
Pair operator 470
Split operator 471
Weight operator 471
Categorize operator 472
Projection operator 472
Linear models 472
Smoothing models 473
Models with missing values 473
Types of graphics 473
Style: point 474
Dotplot. 474
Scatterplot. 475
Trace plot. 475
Style: area 476
Bar charts and pie charts. 476
Spine plot. 478
Histogram. 478
Mosaic plot. 478
Polygon map. 479
Style: curves, lines 479
Style: hybrid 479
Boxplot. 479
Parallel coordinate plots. 480
3D rotating plot. 480
Biplot (PCA). 480
Panel plots 481
Scatterplot matrix. 481
Conditional plots. 481
Style: lists 481
Text list or variable list. 481
Style: tables 482
Extensions for missing values 482
Direct object manipulation 483
Selection 485
Selection tools 486
Zero-dimensional selection tools 487
One-dimensional selection tools 487
Two-dimensional selection tools 487
Selection memory 488
Selection operation 488
Graphical selection 489
Axes based selection 490
Data queries 491
Interaction at the frame level 491
Changing frame 491
Resizing frame 492
Changing frame color 492
Interaction at the type level 492
Operations on the graphical elements 492
Changing graphical elements 492
Changing attributes of graphical elements 493
Adding or removing graphical elements 494
Axes operations 494
Zooming 494
Changing brightness 495
Changing color schemes 495
Reformatting type 496
Changing aspect ratio 498
Sorting data representing objects 498
Interactions at the model level 499
Changing the model 499
Model parameters 499
Inclusion/exclusion of variables 499
Reordering variables 499
Grouping categories 500
Weighting 500
Adding model information 500
Changing scales 501
Re-ordering scales 501
Logical zooming 502
Interaction at sample population level 502
Selecting individuals 502
Grouping 503
Indirect object manipulation 503
Internal linking structures 504
1-to-1 linking 507
1-to-n linking 509
m-to-1 linking 510
Querying 510
Querying a single graphical element 511
Querying two or more graphical elements 511
Interrogating axes 511
External linking structure 512
Linking frames 514
Arranging frames 514
Linking frame size 514
Linking types 515
Linking graphical elements 515
Linking axes 515
Linking models 515
Linking observations 516
Linking scales 517
Linking sample populations 519
Identity linking 519
Hierarchical linking 520
Distance and neighborhood linking 520
Visualization of linked highlighting 521
Attributive highlighting 521
Overlaying 521
Proportional highlighting 522
Juxtaposition 523
Visualization of grouping 524
Linking interrogation 524
Bi-directional linking in the trace plot 525
Linked low-dimensional views 525
Conditional probabilities 527
Detecting outliers 534
Clustering and classification 535
Geometric structure 536
Relationships 538
Models with continuous response 540
Models with discrete response 541
Independence models 546
Conclusion 16 548
Future work 16 549
References 16 550
17. Data Visualization and Virtual Reality 555
Introduction 17 555
Computer graphics 555
Shape 556
Transformation 556
Viewing 557
Color and lighting 557
Texture mapping 558
Transparency 558
Graphics libraries 558
Graphics software tools 559
Visualization 559
Modeling and rendering 560
Animation and simulation 561
File format converters 562
Graphics user interfaces 563
Data visualization 563
Data type 563
Volumetric data - volume rendering 564
Vector data - fluid visualization 564
Large datasets - computation and measurement 564
Abstract data - information visualization 564
Interactive visualization 565
Computational steering 565
Parallel coordinates 566
Linked micromap plots 567
Display panels and study units. 567
Sorting the study units. 567
Linking the related elements of a study unit. 568
Grouping the study units. 569
Micromap magnification. 569
Drill-down and navigation. 569
Overall look of the statistical summaries. 569
Displaying different statistical data sets. 569
Statistical data retrieval. 569
Genetic algorithm data visualization 570
Virtual reality 572
Hardware and software 572
Non-immersive systems 573
Basic VR system properties 573
VR tools 574
VR simulation tools 574
A list of VR tools 574
Basic functions in VR tool 575
Characteristics of VR 576
Some examples of visualization using VR 576
References 17 577
back matter 581
Colour figures 581
index 625
Contents of Previous Volumes 635
| Erscheint lt. Verlag | 2.5.2005 |
|---|---|
| Sprache | englisch |
| Themenwelt | Mathematik / Informatik ► Mathematik ► Statistik |
| Technik | |
| ISBN-10 | 0-08-045940-4 / 0080459404 |
| ISBN-13 | 978-0-08-045940-0 / 9780080459400 |
| Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
| Haben Sie eine Frage zum Produkt? |
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich