Für diesen Artikel ist leider kein Bild verfügbar.

Learning PySpark -

Build faster data processing applications with Spark 2.3

Tomasz Drabas, Denny Lee (Autoren)

Buch | Softcover

193 Seiten

2018 | 2nd edition
Packt Publishing Limited (Verlag)
978-1-78953-884-7 (ISBN)

Keine Verlagsinformationen verfügbar

Artikel merken

Build and deploy data-intensive applications at scale using the combined capability of Python and Spark 2.3
About This Book
* Build ETL pipelines with PySpark and Spark MLlib
* Apply Spark Streaming and Spark SQL with Python
* Perform distributed machine learning and work with Gradient Boosted Trees and Random Forests
Who This Book Is For
Learning PySpark is for big data professionals and data scientists who want to accelerate their data tasks and deliver real-time data analytics. This book is also a good starting point for Python programmers who want to enter the data analytics field and get up and running with Apache Spark and its Python interface.
What You Will Learn
* Get to grips with Apache Spark and the Spark 2.3 architecture
* Build and interact with Spark DataFrames using Spark SQL
* Solve graph and deep learning problems using GraphFrames and TensorFrames respectively
* Read, transform, and understand data, and use it to train machine learning models
* Build machine learning models with MLlib and ML
* Submit your applications using the spark-submit command
* Deploy locally built applications to a cluster
* Run Spark on AWS, Azure, Google Cloud Platform
In Detail
Apache Spark is an open source analytics engine for big data processing application, with built-in modules for streaming, SQL, machine learning, and graph processing. This second edition of Learning PySpark teaches you how to use the PySpark API to good effect and handle big data processing and live streaming applications.
To start with, you'll discover how to use Apache Spark capabilities without learning Scala or Java, and execute simple batch and real-time stream processing tasks. The book focuses on performing machine learning tasks using the PySpark API. You'll explore the latest features of PySpark 2.3, followed by understanding the challenges faced in building real-time data processing applications.
The book also teaches you how to leverage the benefits of Spark DataFrames and address your day-to-day big data problems. You'll explore more practical coverage, along with other Python libraries such as NumPy, Pandas, and Matplotlib, applied in streaming applications.
By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.

Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 12 years' international experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting. Tomasz started his career in 2003 with LOT Polish Airlines in Warsaw, Poland while finishing his Master's degree in strategy management. In 2007, he moved to Sydney to pursue a doctoral degree in operations research at the University of New South Wales, School of Aviation; his research crossed boundaries between discrete choice modeling and airline operations research. During his time in Sydney, he worked as a Data Analyst for Beyond Analysis Australia and as a Senior Data Analyst/Data Scientist for Vodafone Hutchison Australia among others. He has also published scientific papers, attended international conferences, and served as a reviewer for scientific journals. In 2015 he relocated to Seattle to begin his work for Microsoft. While there, he has worked on numerous projects involving solving problems in high-dimensional feature space. Denny Lee is a Principal Program Manager at Microsoft for the Azure DocumentDB team—Microsoft's blazing fast, planet-scale managed document store service. He is a hands-on distributed systems and data science engineer with more than 18 years of experience developing Internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He has extensive experience of building greenfield teams as well as turnaround/ change catalyst. Prior to joining the Azure DocumentDB team, Denny worked as a Technology Evangelist at Databricks; he has been working with Apache Spark since 0.5. He was also the Senior Director of Data Sciences Engineering at Concur, and was on the incubation team that built Microsoft's Hadoop on Windows and Azure service (currently known as HDInsight). Denny also has a Masters in Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise healthcare customers for the last 15 years.

Erscheinungsdatum	25.02.2021
Verlagsort	Birmingham
Sprache	englisch
Maße	191 x 235 mm
Themenwelt	Mathematik / Informatik ► Informatik ► Datenbanken
	Informatik ► Grafik / Design ► Digitale Bildverarbeitung
	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
ISBN-10	1-78953-884-X / 178953884X
ISBN-13	978-1-78953-884-7 / 9781789538847
Zustand	Neuware