Automated Data Collection with R - Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis

Blick ins Buch

Automated Data Collection with R (eBook)

A Practical Guide to Web Scraping and Text Mining

Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis (Autoren)

eBook Download: EPUB

2014
John Wiley & Sons (Verlag)
978-1-118-83480-0 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (EPUB)

A hands on guide to web scraping and text mining for both beginners and experienced users of R

Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
Provides basic techniques to query web documents and data sets (XPath and regular expressions).
An extensive set of exercises are presented to guide the reader through each technique.
Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.
Case studies are featured throughout along with examples for each technique presented.
R code and solutions to exercises featured in the book are provided on a supporting website.

Simon Munzert is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Christian Rubba is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Peter Meißner is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Dominic Nyhuis is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

A hands on guide to web scraping and text mining for both beginners and experienced users of R Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL. Provides basic techniques to query web documents and data sets (XPath and regular expressions). An extensive set of exercises are presented to guide the reader through each technique. Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management. Case studies are featured throughout along with examples for each technique presented. R code and solutions to exercises featured in the book are provided on a supporting website.

Simon Munzert is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley. Christian Rubba is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley. Peter Meißner is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley. Dominic Nyhuis is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Preface xv

1 Introduction 1

1.1 Case study: World Heritage Sites in Danger 1

1.2 Some remarks on web data quality 7

1.3 Technologies for disseminating, extracting, and storing web data 9

1.4 Structure of the book 13

Part One A Primer on Web and Data Technologies 15

2 HTML 17

2.1 Browser presentation and source code 18

2.2 Syntax rules 19

2.3 Tags and attributes 24

2.4 Parsing 32

3 XML and JSON 41

3.1 A short example XML document 42

3.2 XML syntax rules 43

3.3 When is an XML document well formed or valid? 51

3.4 XML extensions and technologies 53

3.5 XML and R in practice 60

3.6 A short example JSON document 68

3.7 JSON syntax rules 69

3.8 JSON and R in practice 71

4 XPath 79

4.1 XPath--a query language for web documents 80

4.2 Identifying node sets with XPath 81

4.3 Extracting node elements 93

5 HTTP 101

5.1 HTTP fundamentals 102

5.2 Advanced features of HTTP 116

5.3 Protocols beyond HTTP 124

5.4 HTTP in action 126

6 AJAX 149

6.1 JavaScript 150

6.2 XHR 154

6.3 Exploring AJAX with Web Developer Tools 158

7 SQL and relational databases 164

7.1 Overview and terminology 165

7.2 Relational Databases 167

7.3 SQL: a language to communicate with Databases 175

7.4 Databases in action 188

8 Regular expressions and essential string functions 196

8.1 Regular expressions 198

8.2 String processing 207

8.3 A word on character encodings 214

Part Two A Practical Toolbox forWeb Scraping and Text Mining 219

9 Scraping the Web 221

9.1 Retrieval scenarios 222

9.2 Extraction strategies 270

9.3 Web scraping: Good practice 278

9.4 Valuable sources of inspiration 290

10 Statistical text processing 295

10.1 The running example: Classifying press releases of the British government 296

10.2 Processing textual data 298

10.3 Supervised learning techniques 307

10.4 Unsupervised learning techniques 313

11 Managing data projects 322

11.1 Interacting with the file system 322

11.2 Processing multiple documents/links 323

11.3 Organizing scraping procedures 328

11.4 Executing R scripts on a regular basis 334

Part Three A Bag of Case Studies 341

12 Collaboration networks in the US Senate 343

12.1 Information on the bills 344

12.2 Information on the senators 350

12.3 Analyzing the network structure 353

12.4 Conclusion 358

13 Parsing information from semistructured documents 359

13.1 Downloading data from the FTP server 360

13.2 Parsing semistructured text data 361

13.3 Visualizing station and temperature data 368

14 Predicting the 2014 Academy Awards using Twitter 371

15 Mapping the geographic distribution of names 380

15.1 Developing a data collection strategy 381

15.2 Website inspection 382

15.3 Data retrieval and information extraction 384

15.4 Mapping names 387

15.5 Automating the process 389

16 Gathering data on mobile phones 396

16.1 Page exploration 396

16.2 Scraping procedure 404

16.3 Graphical analysis 406

16.4 Data storage 408

17 Analyzing sentiments of product reviews 416

17.1 Introduction 416

17.2 Collecting the data 417

17.3 Analyzing the data 426

17.4 Conclusion 434

References 435

General index 442

Package index 448

Function index 449

Preface

The rapid growth of the World Wide Web over the past two decades tremendously changed the way we share, collect, and publish data. Firms, public institutions, and private users provide every imaginable type of information and new channels of communication generate vast amounts of data on human behavior. What was once a fundamental problem for the social sciences—the scarcity and inaccessibility of observations—is quickly turning into an abundance of data. This turn of events does not come without problems. For example, traditional techniques for collecting and analyzing data may no longer suffice to overcome the tangled masses of data. One consequence of the need to make sense of such data has been the inception of “data scientists,” who sift through data and are greatly sought after by researchers and businesses alike.

Along with the triumphant entry of the World Wide Web, we have witnessed a second trend, the increasing popularity and power of open-source software like R. For quantitative social scientists, R is among the most important statistical software. It is growing rapidly due to an active community that constantly publishes new packages. Yet, R is more than a free statistics suite. It also incorporates interfaces to many other programming languages and software solutions, thus greatly simplifying work with data from various sources.

On a personal note, we can say the following about our work with social scientific data:

our financial resources are sparse;
we have little time or desire to collect data by hand;
we are interested in working with up-to-date, high quality, and data-rich sources; and
we want to document our research from the beginning (data collection) to the end (publication), so that it can be reproduced.

In the past, we frequently found ourselves being inconvenienced by the need to manually assemble data from various sources, thereby hoping that the inevitable coding and copy-and-paste errors are unsystematic. Eventually we grew weary of collecting research data in a non-reproducible manner that is prone to errors, cumbersome, and subject to heightened risks of death by boredom. Consequently, we have increasingly incorporated the data collection and publication processes into our familiar software environment that already helps with statistical analyses—R. The program offers a great infrastructure to expand the daily workflow to steps before and after the actual data analysis.

Although R is not about to collect survey data on its own or conduct experiments any time soon, we do consider the techniques presented in this book as more than the “the poor man's substitute” for costly surveys, experiments, and student-assistant coders. We believe that they are a powerful supplement to the portfolio of modern data analysts. We value the collection of data from online resources not only as a more cost-sensitive solution compared to traditional data acquisition methods, but increasingly think of it as the exclusive approach to assemble datasets from new and developing sources. Moreover, we cherish program-based solutions because they guarantee reliability, reproducibility, time-efficiency, and assembly of higher quality datasets. Beyond productivity, you might find that you enjoy writing code and drafting algorithmic solutions to otherwise tedious manual labor. In short, we are convinced that if you are willing to make the investment and adopt the techniques proposed in this book, you will benefit from a lasting improvement in the ease and quality with which you conduct your data analyses.

If you have identified online data as an appropriate resource for your project, is web scraping or statistical text processing and therefore an automated or semi-automated data collection procedure really necessary? While we cannot hope to offer any definitive guidelines, here are some useful criteria. If you find yourself answering several of these affirmatively, an automated approach might be the right choice:

Do you plan to repeat the task from time to time, for example, in order to update your database?
Do you want others to be able to replicate your data collection process?
Do you deal with online sources of data frequently?
Is the task non-trivial in terms of scope and complexity?
If the task can also be accomplished manually—do you lack the resources to let others do the work?
Are you willing to automate processes by means of programming?

Ideally, the techniques presented in this book enable you to create powerful collections of existing, but unstructured or unsorted data no one has analyzed before at very reasonable cost. In many cases, you will not get far without rethinking, refining, and combining the proposed techniques due to your subjects’ specifics. In any case, we hope you find the topics of this book inspiring and perhaps even eye opening: The streets of the Web are paved with data that cannot wait to be collected.

What you won't learn from reading this book

When you browse the table of contents, you get a first impression of what you can expect to learn from reading this book. As it is hard to identify parts that you might have hoped for but that are in fact not covered in this book, we will name some aspects that you will not find in this volume.

What you will not get in this book is an introduction to the R environment. There are plenty of excellent introductions—both printed and online—and this book won't be just another addition to the pile. In case you have not previously worked with R, there is no reason to set this book aside in disappointment. In the next section we'll suggest some well-written R introductions.

You should also not expect the definitive guide to web scraping or text mining. First, we focus on a software environment that was not specifically tailored to these purposes. There might be applications where R is not the ideal solution for your task and other software solutions might be more suited. We will not bother you with alternative environments such as PHP, Python, Ruby, or Perl. To find out if this book is helpful for you, you should ask yourself whether you are already using or planning to use R for your daily work. If the answer to both questions is no, you should probably consider your alternatives. But if you already use R or intend to use it, you can spare yourself the effort to learn yet another language and stay within a familiar environment.

This book is not strictly speaking about data science either. There are excellent introductions to the topic like the recently published books by O'Neil and Schutt (2013), Torgo (2010), Zhao (2012), and Zumel and Mount (2014). What is occasionally missing in these introductions is how data for data science applications are actually acquired. In this sense, our book serves as a preparatory step for data analyses but also provides guidance on how to manage available information and keep it up to date.

Finally, what you most certainly will not get is the perfect solution to your specific problem. It is almost inherent in the data collection process that the fields where the data are harvested are never exactly alike, and sometimes rapidly change shape. Our goal is to enable you to adapt the pieces of code provided in the examples and case studies to create new pieces of code to help you succeed in collecting the data you need.

Why R?

There are many reasons why we think that R is a good solution for the problems that are covered in this book. To us, the most important points are:

R is freely and easily accessible. You can download, install, and use it wherever and whenever you want. There are huge benefits to not being a specialist in expensive proprietary programs, as you do not depend on the willingness of employers to pay licensing fees.
For a software environment with a primarily statistical focus, R has a large community that continues to flourish. R is used by various disciplines, such as social scientists, medical scientists, psychologists, biologists, geographers, linguists, and also in business. This range allows you to share code with many developers and profit from well-documented applications in diverse settings.
R is open source. This means that you can easily retrace how functions work and modify them with little effort. It also means that program modifications are not controlled by an exclusive team of programmers that takes care of the product. Even if you are not interested in contributing to the development of R, you will still reap the benefits from having access to a wide variety of optional extensions—packages. The number of packages is continuously growing and many existing packages are frequently updated. You can find nice overviews of popular themes in R usage on http://cran.r-project.org/web/views/.
R is reasonably fast in ordinary tasks. You will likely agree with this impression if you have used other statistical software like SPSS or Stata and have gotten into the habit of going on holiday when running more complex models—not to mention the pain that is caused by the “one session, one data frame” logic. There are even extensions to speed up...

Erscheint lt. Verlag	18.12.2014
Sprache	englisch
Themenwelt	Informatik ► Datenbanken ► Data Warehouse / Data Mining
	Mathematik / Informatik ► Mathematik ► Statistik
	Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik
	Technik
Schlagworte	Book • CASE • Code • Content • Danger • Data • Data Mining • Data Mining Statistics • data storage • Disseminating • Document • Documents • external reference • extraction • Heritage • HTML • Information • Introduction • onweb • Part • Primer • remarks • Research Methodologies • R (Programm) • Sociology • Soziologie • Soziologische Forschungsmethoden • Statistical Software / R • Statistics • Statistik • Statistiksoftware / R • Study • Technologies • Web • World
ISBN-10	1-118-83480-1 / 1118834801
ISBN-13	978-1-118-83480-0 / 9781118834800

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

EPUB (Adobe DRM)

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.