A comprehensive compilation of new developments in data linkage methodology
The increasing availability of large administrative databases has led to a dramatic rise in the use of data linkage, yet the standard texts on linkage are still those which describe the seminal work from the 1950-60s, with some updates. Linkage and analysis of data across sources remains problematic due to lack of discriminatory and accurate identifiers, missing data and regulatory issues. Recent developments in data linkage methodology have concentrated on bias and analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage.
Methodological Developments in Data Linkage brings together a collection of contributions from members of the international data linkage community, covering cutting edge methodology in this field. It presents opportunities and challenges provided by linkage of large and often complex datasets, including analysis problems, legal and security aspects, models for data access and the development of novel research areas. New methods for handling uncertainty in analysis of linked data, solutions for anonymised linkage and alternative models for data collection are also discussed.
Key Features:
- Presents cutting edge methods for a topic of increasing importance to a wide range of research areas, with applications to data linkage systems internationally
- Covers the essential issues associated with data linkage today
- Includes examples based on real data linkage systems, highlighting the opportunities, successes and challenges that the increasing availability of linkage data provides
- Novel approach incorporates technical aspects of both linkage, management and analysis of linked data
This book will be of core interest to academics, government employees, data holders, data managers, analysts and statisticians who use administrative data. It will also appeal to researchers in a variety of areas, including epidemiology, biostatistics, social statistics, informatics, policy and public health.
Editors:
Katie Harron, London School of Hygiene and Tropical Medicine, UK
Harvey Goldstein, University of Bristol and University College London, UK
Chris Dibben, University of Edinburgh, UK
A comprehensive compilation of new developments in data linkage methodology The increasing availability of large administrative databases has led to a dramatic rise in the use of data linkage, yet the standard texts on linkage are still those which describe the seminal work from the 1950-60s, with some updates. Linkage and analysis of data across sources remains problematic due to lack of discriminatory and accurate identifiers, missing data and regulatory issues. Recent developments in data linkage methodology have concentrated on bias and analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage. Methodological Developments in Data Linkage brings together a collection of contributions from members of the international data linkage community, covering cutting edge methodology in this field. It presents opportunities and challenges provided by linkage of large and often complex datasets, including analysis problems, legal and security aspects, models for data access and the development of novel research areas. New methods for handling uncertainty in analysis of linked data, solutions for anonymised linkage and alternative models for data collection are also discussed. Key Features: Presents cutting edge methods for a topic of increasing importance to a wide range of research areas, with applications to data linkage systems internationally Covers the essential issues associated with data linkage today Includes examples based on real data linkage systems, highlighting the opportunities, successes and challenges that the increasing availability of linkage data provides Novel approach incorporates technical aspects of both linkage, management and analysis of linked data This book will be of core interest to academics, government employees, data holders, data managers, analysts and statisticians who use administrative data. It will also appeal to researchers in a variety of areas, including epidemiology, biostatistics, social statistics, informatics, policy and public health.
Editors: Katie Harron, London School of Hygiene and Tropical Medicine, UK Harvey Goldstein, University of Bristol and University College London, UK Chris Dibben, University of Edinburgh, UK
1
Introduction
Katie Harron1, Harvey Goldstein2,3 and Chris Dibben4
1 London School of Hygiene and Tropical Medicine, London, UK
2 Institute of Child Health, University College London, London, UK
3 Graduate School of Education, University of Bristol, Bristol, UK
4 University of Edinburgh, Edinburgh, UK
1.1 Introduction: data linkage as it exists
The increasing availability of large administrative databases for research has led to a dramatic rise in the use of data linkage. The speed and accuracy of linkage have much improved over recent decades with developments such as string comparators, coding systems and blocking, yet the methods still underpinning most of the linkage performed today were proposed in the 1950s and 1960s. Linkage and analysis of data across sources remain problematic due to lack of identifiers that are totally accurate as well as being discriminatory, missing data and regulatory issues, especially concerned with privacy.
In this context, recent developments in data linkage methodology have concentrated on bias in the analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage. Methodological developments in data linkage bring together a collection of chapters on cutting-edge developments in data linkage methodology, contributed by members of the international data linkage community.
The first section of the book covers the current state of data linkage, methodological issues that are relevant to linkage systems and analyses today and case studies from the United Kingdom, Canada and Australia. In this introduction, we provide a brief background to the development of data linkage methods and introduce common terms. We highlight the most important issues that have emerged in recent years and describe how the remainder of the book attempts to deal with these issues. Chapter 2 summarises the advances in linkage accuracy and speed that have arisen from the traditional probabilistic methods proposed by Fellegi and Sunter. The first section concludes with a description of the data linkage environment as it is today, with case study examples. Chapter 3 describes the opportunities and challenges provided by data linkage, focussing on legal and security aspects and models for data access and linkage.
The middle section of the book focusses on the immediate future of data linkage, in terms of methods that have been developed and tested and can be put into practice today. It concentrates on analysis of linked data and the difficulties associated with linkage uncertainty, highlighting the problems caused by errors that occur in linkage (false matches and missed matches) and the impact that these errors can have on the reliability of results based on linked data. This section of the book discusses two methods for handling linkage error, the first relating to regression analyses and the second to an extension of the standard multiple imputation framework. Chapter 7 presents an alternative data storage solution compared to relational databases that provides significant benefits for linkage.
The final section of the book tackles an aspect of the potential future of data linkage. Ethical considerations relating to data linkage and research based on linked data are a subject of continued debate. Privacy-preserving data linkage attempts to avoid the controversial release of personal identifiers by providing means of linking and performing analysis on encrypted data. This section of the book describes the debate and provides examples.
The establishment of large-scale linkage systems has provided new opportunities for important and innovative research that, until now, have not been possible but that also present unique methodological and organisational challenges. New linkage methods are now emerging that take a different approach to the traditional methods that have underpinned much of the research performed using linked data in recent years, leading to new possibilities in terms of speed, accuracy and transparency of research.
1.2 Background and issues
A statistical definition of data linkage is ‘a merging that brings together information from two or more sources of data with the object of consolidating facts concerning an individual or an event that are not available in any separate record’ (Organisation for Economic Co-operation and Development (OECD)). Data linkage has many different synonyms (record linkage, record matching, re-identification, entity heterogeneity, merge/purge) within various fields of application (computer science, marketing, fraud detection, censuses, bibliographic data, insurance data) (Elmagarmid, Ipeirotis and Verykios, 2007).
The term ‘record linkage’ was first applied to health research in 1946, when Dunn described linkage of vital records from the same individual (birth and death records) and referred to the process as ‘assembling the book of life’ (Dunn, 1946). Dunn emphasised the importance of such linkage to both the individual and health and other organisations. Since then, data linkage has become increasingly important to the research environment.
The development of computerised data linkage meant that valuable information could be combined efficiently and cost-effectively, avoiding the high cost, time and effort associated with setting up new research studies (Newcombe et al., 1959). This led to a large body of research based on enhanced datasets created through linkage. Internationally, large linkage systems of note are the Western Australia Record Linkage System, which links multiple datasets (over 30) for up to 40 years at a population level, and the Manitoba Population-Based Health Information System (Holman et al., 1999; Roos et al., 1995). In the United Kingdom, several large-scale linkage systems have also been developed, including the Scottish Health Informatics Programme (SHIP), the Secure Anonymised Information Linkage (SAIL) Databank and the Clinical Practice Research Datalink (CPRD). As data linkage becomes a more established part of research relating to health and society, there has been an increasing interest in methodological issues associated with creating and analysing linked datasets (Maggi, 2008).
1.3 Data linkage methods
Data linkage brings together information relating to the same individual that is recorded in different files. A set of linked records is created by comparing records, or parts of records, in different files and applying a set of linkage criteria or rules to determine whether or not records belong to the same individual. These rules utilise the values on ‘linking variables’ that are common to each file. The aim of linkage is to determine the true match status of each comparison pair: a match if records belong to the same individual and a non-match if records belong to different individuals.
As the true match status is unknown, linkage criteria are used to assign a link status for each comparison pair: a link if records are classified as belonging to the same individual and a non-link if records are classified as belonging to different individuals.
In a perfect linkage, all matches are classified as links, and all non-matches are classified as non-links. If comparison pairs are misclassified (false matches or missed matches), error is introduced. False matches occur when records from different individuals link erroneously; missed matches occur when records from the same individual fail to link.
1.3.1 Deterministic linkage
In deterministic linkage, a set of predetermined rules are used to classify pairs of records as links and non-links. Typically, deterministic linkage requires exact agreement on a specified set of identifiers or matching variables. For example, two records may be classified as a link if their values of National Insurance number, surname and sex agree exactly. Modifications of strict deterministic linkage include ‘stepwise’ deterministic linkage, which uses a succession of rules; the ‘n−1’ deterministic procedure, which allows a link to be made if all but one of a set of identifiers agree; and ad hoc deterministic procedures, which allow partial identifiers to be combined into a pseudo-identifier (Abrahams and Davy, 2002; Maso, Braga and Franceschi, 2001; Mears et al., 2010). For example, a combination of the first letter of surname, month of birth and postcode area (e.g. H01N19) could form the basis for linkage.
Strict deterministic methods that require identifiers to match exactly often have a high rate of missed matches, as any recording errors or missing values can prevent identifiers from agreeing. Conversely, the rate of false matches is typically low, as the majority of linked pairs are true matches (records are unlikely to agree exactly on a set of identifiers by chance) (Grannis, Overhage and McDonald, 2002). Deterministic linkage is a relatively straightforward and quick linkage method and is useful when records have highly discriminative or unique identifiers that are well completed and accurate. For example, the community health index (CHI) is used for much of the linkage in the Scottish Record Linkage System.
1.3.2 Probabilistic linkage
Newcombe was the first to propose that comparison pairs could be classified using a probabilistic approach (Newcombe et al.,...
| Erscheint lt. Verlag | 22.9.2015 |
|---|---|
| Reihe/Serie | Wiley Series in Probability and Statistics |
| Wiley Series in Probability and Statistics | Wiley Series in Probability and Statistics |
| Sprache | englisch |
| Themenwelt | Informatik ► Datenbanken ► Data Warehouse / Data Mining |
| Mathematik / Informatik ► Mathematik ► Statistik | |
| Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik | |
| Studium ► Querschnittsbereiche ► Epidemiologie / Med. Biometrie | |
| Naturwissenschaften | |
| Technik | |
| Schlagworte | administrative data • Census • confidentiality • Data Analysis • data collection • data linkage • Datenanalyse • Medical Statistics & Epidemiology • Medizinische Statistik u. Epidemiologie • Methoden der Daten- u. Stichprobenerhebung • Privacy preserving linkage • Probabilistic matching • Record Linkage • secondary analysis • Statistics • Statistik • Stichprobe • Survey Research Methods & Sampling • trusted third party |
| ISBN-13 | 9781119072485 / 9781119072485 |
| Informationen gemäß Produktsicherheitsverordnung (GPSR) | |
| Haben Sie eine Frage zum Produkt? |
Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM
Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.
Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine
Geräteliste und zusätzliche Hinweise
Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.
aus dem Bereich