Blick ins Buch

Pro Spark Streaming (eBook)

The Zen of Real-Time Analytics Using Apache Spark

Zubair Nabi (Autor)

eBook Download: PDF

2016 | 1st ed.
XIX, 230 Seiten
Apress (Verlag)
978-1-4842-1479-4 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (PDF)

Learn the right cutting-edge skills and knowledge to leverage Spark Streaming to implement a wide array of real-time, streaming applications. Pro Spark Streaming walks you through end-to-end real-time application development using real-world applications, data, and code. Taking an application-first approach, each chapter introduces use cases from a specific industry and uses publicly available datasets from that domain to unravel the intricacies of production-grade design and implementation. The domains covered in the book include social media, the sharing economy, finance, online advertising, telecommunication, and IoT.

In the last few years, Spark has become synonymous with big data processing. DStreams enhance the underlying Spark processing engine to support streaming analysis with a novel micro-batch processing model. Pro Spark Streaming by Zubair Nabi will enable you to become a specialist of latency sensitive applications by leveraging the key features of DStreams, micro-batch processing, and functional programming. To this end, the book includes ready-to-deploy examples and actual code. Pro Spark Streaming will act as the bible of Spark Streaming.

What You'll Learn:

Spark Streaming application development and best practices
The application and vitality of streaming analytics to a number of industries and domains
Optimization of production-grade deployments of Spark Streaming via configuration recipes and instrumentation using Graphite, collectd, and Nagios
Integration and coupling with HBase, Cassandra, and Redis
Design patterns for side-effects and maintaining state across the Spark Streaming micro-batch model
Streaming machine learning, predictive analytics, and recommendations
Meshing batch processing with stream processing via the Lambda architecture

Who This Book Is For:

The audience includes data scientists, big data experts, BI analysts, and data architects.

Zubair's work has been featured in MIT Technology Review, SciDev, CNET, and Asian Scientist, and on Swedish National Radio, among others. He has authored more than 20 research papers, published by some of the top publication venues in computer science including USENIX Middleware, ECML PKDD, and IEEE BigData; and he also has a number of patents to his credit.

Zubair has an MPhil in computer science with distinction from Cambridge.

Learn the right cutting-edge skills and knowledge to leverage Spark Streaming to implement a wide array of real-time, streaming applications. This book walks you through end-to-end real-time application development using real-world applications, data, and code. Taking an application-first approach, each chapter introduces use cases from a specific industry and uses publicly available datasets from that domain to unravel the intricacies of production-grade design and implementation. The domains covered in Pro Spark Streaming include social media, the sharing economy, finance, online advertising, telecommunication, and IoT. In the last few years, Spark has become synonymous with big data processing. DStreams enhance the underlying Spark processing engine to support streaming analysis with a novel micro-batch processing model. Pro Spark Streaming by Zubair Nabi will enable you to become a specialist of latency sensitive applications by leveraging the key features of DStreams, micro-batch processing, and functional programming. To this end, the book includes ready-to-deploy examples and actual code. Pro Spark Streamingwill act as the bible of Spark Streaming.What You'll LearnDiscover Spark Streaming application development and best practicesWork with the low-level details of discretized streamsOptimize production-grade deployments of Spark Streaming via configuration recipes and instrumentation using Graphite, collectd, and NagiosIngest data from disparate sources including MQTT, Flume, Kafka, Twitter, and a custom HTTP receiverIntegrate and couple with HBase, Cassandra, and RedisTake advantage of design patterns for side-effects and maintaining state across the Spark Streaming micro-batch modelImplement real-time and scalable ETL using data frames, SparkSQL, Hive, and SparkRUse streaming machine learning, predictive analytics, and recommendationsMesh batch processing with stream processing via the Lambda architectureWho This Book Is ForData scientists, big data experts, BI analysts, and data architects.

Zubair Nabi is one of the very few computer scientists who have solved Big Data problems in all three domains: academia, research, and industry. He currently works at Qubit, a London-based start up backed by Goldman Sachs, Accel Partners, Salesforce Ventures, and Balderton Capital. Qubit helps retailers understand their customers and provide personalized customer experience, and which has a rapidly growing client base that includes Staples, Emirates, Thomas Cook, and Topshop. Prior to Qubit, he was a researcher at IBM Research, where he worked at the intersection of Big Data systems and analytics to solve real-world problems in the telecommunication, electricity, and urban dynamics space.Zubair’s work has been featured in MIT Technology Review, SciDev, CNET, and Asian Scientist, and on Swedish National Radio, among others. He has authored more than 20 research papers, published by some of the top publication venues in computer science including USENIX Middleware, ECML PKDD, and IEEE BigData; and he also has a number of patents to his credit.Zubair has an MPhil in computer science with distinction from Cambridge.

Contents at a Glance 6
Contents 8
About the Author 14
About the Technical Reviewer 16
Acknowledgments 18
Introduction 20
Chapter 1: The Hitchhiker’s Guide to Big Data 21
Before Spark 21
The Era of Web 2.0 22
From SQL to NoSQL 22
MapReduce: The Swiss Army Knife of Distributed Data Processing 23
Word Count a la MapReduce 24
Hadoop: An Elephant with Big Dreams 25
Sensors, Sensors Everywhere 26
Spark Streaming: At the Intersection of MapReduce and CEP 28
Chapter 2: Introduction to Spark 29
Installation 30
Execution 31
Standalone Cluster 31
Master 31
Workers 31
UI 31
YARN 32
First Application 32
Build 34
Execution 35
Local Execution 35
Standalone Cluster 35
YARN 37
SparkContext 37
Creation of RDDs 37
Handling Dependencies 38
Creating Shared Variables 39
Job execution 40
RDD 40
Persistence 41
Transformations 42
Actions 46
Summary 47
Chapter 3: DStreams: Real-Time RDDs 48
From Continuous to Discretized Streams 48
First Streaming Application 49
Build and Execution 51
StreamingContext 51
Creating DStreams 52
DStream Consolidation 53
Job Execution 53
DStreams 53
The Anatomy of a Spark Streaming Application 55
Transformations 59
Mapping 59
map[U](function): DStream[U] 59
mapPartitions[U](function): DStream[U] 59
flatMap[U](function): DStream[U] 60
filter(function): DStream[T] 60
transform[U](function): DStream[U] 60
Variation 61
union(that: DStream[T]): DStream[T] 61
repartition(numPartitions: Int): DStream[T] 61
glom(): DStream[Array[T]] 61
Aggregation 61
count(): DStream[Long] 61
countByValue(): DStream[(T, Long)] 61
reduce(reduceFunc: (T, T) ? T): DStream[T] 62
Key-value 62
groupByKey(): DStream[(K, Iterable[V])] 62
reduceByKey(reduceFunc: (V, V) ? V): DStream[(K, V)] 62
combineByKey[C](createCombiner: (V) ? C, mergeValue: (C, V) ? C, mergeCombiner: (C, C) ? C, partitioner: Partitioner): DStream[(K, C)] 63
join[W](other: DStream[(K, W)]): DStream[(K, (V, W))] 63
cogroup[W](other: DStream[(K, W)]): DStream[(K, (Iterable[V], Iterable[W]))] 64
updateStateByKey[S](updateFunc: (Seq[V], Option[S]) ? Option[S]): DStream[(K, S)] 64
Windowing 65
window(windowDuration: Duration, slideDuration: Duration): DStream[T] 65
Actions 68
print(num: Int): Unit 68
print(): Unit 68
saveAsObjectFiles(prefix: String): Unit 68
saveAsTextFiles(prefix: String): Unit 68
saveAsHadoopFiles[F < : OutputFormat[K, V]](prefix: String, suffix: String): Unit
saveAsNewAPIHadoopFiles[F < : OutputFormat[K, V]](prefix: String, suffix: String): Unit
foreachRDD(foreachFunc: (RDD[T]) ? Unit): Unit 69
Summary 69
Chapter 4: High-Velocity Streams: Parallelism and Other Stories 70
One Giant Leap for Streaming Data 70
Parallelism 72
Worker 72
Executor 73
Choosing the Number of Executors 74
Dynamic Executor Allocation 74
Task 75
Parallelism, Partitions, and Tasks 75
Task Parallelism 76
Batch Intervals 78
Scheduling 79
Inter-application Scheduling 79
Batch Scheduling 80
Inter-job Scheduling 80
One Action, One Job 80
Memory 82
Serialization 82
Compression 84
Garbage Collection 84
Every Day I’m Shuffling 85
Early Projection and Filtering 85
Always Use a Combiner 85
Generous Parallelism 85
File Consolidation 85
More Memory 85
Summary 86
Chapter 5: Real-Time Route 66: Linking External Data Sources 87
Smarter Cities, Smarter Planet, Smarter Everything 87
ReceiverInputDStream 89
Sockets 90
MQTT 98
Flume 102
Push-Based Flume Ingestion 103
Pull-Based Flume Ingestion 104
Kafka 104
Receiver-Based Kafka Consumer 107
Direct Kafka Consumer 109
Twitter 110
Block Interval 111
Custom Receiver 111
HttpInputDStream 112
Summary 115
Chapter 6: The Art of Side Effects 116
Taking Stock of the Stock Market 116
foreachRDD 118
Per-Record Connection 120
Per-Partition Connection 120
Static Connection 121
Lazy Static Connection 122
Static Connection Pool 123
Scalable Streaming Storage 125
HBase 125
Stock Market Dashboard 127
SparkOnHBase 129
Cassandra 130
Spark Cassandra Connector 132
Global State 133
Static Variables 133
updateStateByKey() 135
Accumulators 136
External Solutions 138
Redis 138
Summary 140
Chapter 7: Getting Ready for Prime Time 141
Every Click Counts 141
Tachyon (Alluxio) 142
Spark Web UI 144
Historical Analysis 158
RESTful Metrics 158
Logging 159
External Metrics 160
System Metrics 162
Monitoring and Alerting 163
Summary 165
Chapter 8: Real-Time ETL and Analytics Magic 166
The Power of Transaction Data Records 166
First Streaming Spark SQL Application 168
SQLContext 170
Data Frame Creation 170
Existing RDDs 170
Dynamic Schemas 170
Scala Sequence 172
RDDs with JSON 172
External Database 172
Parquet 172
Hive Table 173
SQL Execution 173
Configuration 173
User-Defined Functions 174
Catalyst: Query Execution and Optimization 175
HiveContext 175
Data Frame 176
Types 177
Query Transformations 177
select(col: String, cols: String*): Data Frame 177
select(cols: Column*): DataFrame 177
filter(conditionExpr: String): DataFrame 177
drop(colName: String): DataFrame 178
where(condition: Column): DataFrame 178
limit(n: Int): DataFrame 178
withColumn(colName: String, col: Column): DataFrame 178
groupBy(col1: String, cols: String: GroupedData 178
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame 179
orderBy(sortCol: String, sortCols: String*): DataFrame 179
rollup(col1: String, cols: String*): GroupedData 180
cube(col1: String, cols: String*): GroupedData 180
dropDuplicates(colNames: Seq[String]): DataFrame 180
sample(withReplacement: Boolean, fraction: Double): DataFrame 180
except(other: DataFrame): DataFrame 180
intersect(other: DataFrame): DataFrame 181
unionAll(other: DataFrame): DataFrame 181
join(right: DataFrame, joinExprs: Column): DataFrame 181
na 182
stats 182
Actions 183
format(source: String): DataFrameWriter 183
save(path: String): Unit 183
parquet(path: String): Unit 184
json(path: String): Unit 184
saveAsTable(tableName: String): Unit 184
mode(saveMode: SaveMode): DataFrameWriter 184
partitionBy(colNames: String*): DataFrameWriter 184
insertInto(tableName: String): Unit 184
jdbc(url: String, table: String, connectionProperties: Properties): Unit 184
RDD Operations 185
Persistence 185
Best Practices 185
SparkR 185
First SparkR Application 186
Execution 187
Streaming SparkR 188
Summary 190
Chapter 9: Machine Learning at Scale 191
Sensor Data Storm 191
Streaming MLlib Application 193
MLlib 196
Data Types 196
Statistical Analysis 198
Preprocessing 199
Feature Selection and Extraction 200
Chi-Square Selection 200
Principal Component Analysis 201
Learning Algorithms 201
Classification 202
Clustering 203
Recommendation Systems 204
Frequent Pattern Mining 207
Streaming ML Pipeline Application 208
ML 210
Cross-Validation of Pipelines 211
Summary 212
Chapter 10: Of Clouds, Lambdas, and Pythons 213
A Good Review Is Worth a Thousand Ads 214
Google Dataproc 214
First Spark on Dataproc Application 219
PySpark 226
Lambda Architecture 228
Lambda Architecture using Spark Streaming on Google Cloud Platform 229
Streaming Graph Analytics 236
Summary 239
Index 240

Erscheint lt. Verlag	13.6.2016
Zusatzinfo	XIX, 230 p. 68 illus., 61 illus. in color.
Verlagsort	Berkeley
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
Themenwelt	Mathematik / Informatik ► Informatik ► Netzwerke
Schlagworte	Spark • Spark Streaming • Spark Streaming Application • Spark Streaming R • Spark Streaming SQL • Streaming Machine Learning
ISBN-10	1-4842-1479-X / 148421479X
ISBN-13	978-1-4842-1479-4 / 9781484214794

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

PDF (Wasserzeichen)

DRM: Digitales Wasserzeichen
Dieses eBook enthält ein digitales Wasserzeichen und ist damit für Sie personalisiert. Bei einer missbräuchlichen Weitergabe des eBooks an Dritte ist eine Rückverfolgung an die Quelle möglich.

Dateiformat: PDF (Portable Document Format)
Mit einem festen Seitenlayout eignet sich die PDF besonders für Fachbücher mit Spalten, Tabellen und Abbildungen. Eine PDF kann auf fast allen Geräten angezeigt werden, ist aber für kleine Displays (Smartphone, eReader) nur eingeschränkt geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen dafür einen PDF-Viewer - z.B. den Adobe Reader oder Adobe Digital Editions.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen dafür einen PDF-Viewer - z.B. die kostenlose Adobe Digital Editions-App.

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.

Print-Ausgabe

Buch | Softcover

CHF 52,40