Concept: Record Linkage / Data Linakge
Last Updated: 2006-04-06
Record linkage, or data linkage, is simply the integration of information from two independent sources. Records from the two sources that are believed to relate to the same individual are matched in such a way that they may then be treated as a single record for that individual. Records brought together in this way are said to be
The principles of record linkage may be applied to any field in which it is necessary to bring together information recorded about persons in different places or at different times.
Record linkage techniques can help identify the same patient in various types of files - hospital discharge abstracts, insurance claims, Registries, and Vital Statistics data - which contain similar identifiers. Record linkage facilitates cross checks between separately maintained datasets to highlight quality problems and assists in the use of administrative records for research purposes.
Records are linked on the basis of common identification data. The data used may vary depending on the purposes of the linkage and the information available for linkage. (E.g. identification data frequently used in medical record linkage include birth date, death date, marital status, sex, place of residence, diagnosis, surgical procedure, date of hospital admission and/or discharge, and health insurance number).
When a unique identification variable (e.g.: Personal Health Identification Number ) is present on both of the files involved in the linkage, the process of linking is relatively simple. Such linkages involve matching records from the two data sets on the basis of the unique identifier.
However, such unique identifiers do not always exist in both files. It is therefore necessary to use identifying variables such as surname, given name, date of birth, etc. in order to link records from the two sources. In many cases, such identifying characteristics may not be unique to a particular individual. They may change over time, they may have been recorded incorrectly, or they may be missing in certain records. Thus, the identifying characteristics from the two sources must be compared to estimate the likelihood that a potential link is a "true" one.
Record linkage has become increasingly useful in health care administration, demographic studies, the provision of health statistics, and in medical research. It has enabled health care administrators and researchers to:
Maintain a comprehensive database.
Linkages between administrative data sources have provided the basis for the Manitoba Health database, a comprehensive, multi-file database which permits many different types of health-related research.
Link to local statistical files.
Linkages involving cancer registries, AIDS registries, and mortality data have been used to study mortality from these diseases. Such linkages can also provide the beginnings for a comprehensive database maintained locally. As more information becomes available, linkage assists in expanding and correcting the database.
Link to survey data.
Survey information linked with insurance claims has been used to provide a more accurate picture of the aging process by studying the relationships between functional status, self-reported health status, and the utilization of health care services.
- More information on linking to survey data, based on work with NPHS data is available Linkage of Large Survey Data to Other Databases (internal access only)
Link to clinical data.
Linking clinical data with hospital claims and Vital Statistics information has been used to produce a rich dataset on patients' pre-operative status and post-operative outcomes, allowing for the evaluation of a wide range of procedures. Links with clinical information have also proved important in establishing the validity of research which relies on claims data.
- Link to Social data. Linking social and health data will produce new perspectives on the relationship between health and social variables.
Before you link, it is important to make sure that the variables to be used as linkage keys are formatted in the same way on each file. Differences in capitalization, justification, leading zeroes, can all make the same value look different to a linkage program.
Convert all names to uppercase.
Remove period after St. for names starting with 'St.'
- Take first space-delimited value in each name field. If the name is compound, (de Braun, van Dyck, van den Bergh, etc.) take all components and remove spaces. van den Bergh becomes VANDENBERGH .
- Faster turnaround time
- Increases likelihood of trying different strategies
Names present a special problem as the same name can be represented in many different ways. Alternate spellings, initials, abbreviations, shortened forms of names, changes in last name due to marriage, and people going by their middle name instead of their 'real' first name can all make linkage difficult.
Some simple strategies for standardizing name fields which proved useful in this project:
If done to both files, this can eliminate a lot of mismatch due to formatting.
Soundex coding: Can be useful to identify 'close' matches which fail due to variant spellings of names. The Soundex algorithm associates numbers with different groups of consonants, producing a numeric code following the initial letter which is robust to variations in names that sound alike. For example, Anthony and Antony both have the Soundex code A535, but the slightly different spellings of these names is enough to cause a linkage to fail when comparing the actual names.
Note : Soundex does not help when the variants do not sound alike, or start with different letters (e.g. Bill has code B4, and William has W45). This may be handled by tokenization, or converting all variants of a name to a token representing that name. All variants of Bill, William, Will, Willy, etc. could be converted to the same token (i.e. BILL). This should be used as more of a last ditch effort for the truly hard to link cases than an initial strategy, since an exact match on name is stronger than a match on token alone.Prototyping: This is simply developing the program on a small sample of data before running the entire linkage. Usually the size of the files in a typical linkage project makes development prohibitively expensive, computationally speaking. Anything you can do to reduce the turnaround time of trying a new strategy, running the program, looking at the results, and trying something else is well worth the effort. The time you spend waiting for the program to run is better spent interpreting the results and fine-tuning the linkage.
When prototyping, make sure the sub-sample you choose is representative. The whole point is to get an idea of what will work in the actual data. If your sample is not representative then your strategy may be sub-optimal when applied to the whole file.
A match in deterministic linkage is made when a sufficient number of identifiers agree between two records. In the simplest and most restrictive case, all identifiers are required to agree. More flexible rules can be used which allows some pre-defined subset of identifiers to "determine" a link, i.e.: "match on at least three of five identifiers" or "match on MHSIP number, sex, and two of birth year, birth month and first initial".
See Example SAS® Merge below.
One of the major limitations of deterministic linkage is that it considers each identifier to be of equal quality. Agreement on one identifier provides no stronger evidence for a link than agreement on any of the others. Consequently, it is impossible to resolve ties, which occur when one record matches with two (or more) others on the same number of identifiers.
In practice, identifiers differ in the amount of information they contain about an individual. Also, real data often contain missing or incorrect values, with some identifiers coded more reliably than others. A single miscoded value can cause a link to fail, even if the evidence for a link based on other identifiers is perfect.
One way to take this difference into account is probabilistic linkage. Here we are not only concerned with how many identifiers match, but also which ones. A match on three strong identifiers will be taken over a match on three weaker ones, whereas in deterministic linkage this would have resulted in a tie.
The strength of an identifier is most commonly measured by calculating the amount of information conveyed by the values of the variable. Variables with many potential values, such as birth day or month, usually contain more information than ones with few, such as sex. It is much less likely, for example, that two records selected at random will have the same birthday than the same sex. A match on day of birth, then, is considered stronger evidence for a link than a match on sex, because of the much higher probability of the match on sex being be due entirely to chance.
Depending on the type of comparison, probabilistic weights may be either non-specific or value specific.
General (non-specific) weights are based on the agreement/disagreement of a specific field. For example, using general weights, agreement on birth date may be given a weight of 3.5, while disagreement is given a weight of -2.7.
Value specific weights are based on the agreement of a specific value of the field being compared. For example, if comparing initials using value specific weights, a match of the initial B will receive a different weight than a match of the initial A. In general, rare agreements carry higher weights.
Weights for probabilistic matching are computed building on log2 of the odds or frequency ratio, as calculated for each variable:
|weight = log2||OUTCOME frequency in LINKED pairs|
|OUTCOME frequency in UNLINKABLE pairs|
The OUTCOMES are defined by the user, they can be agreements or disagreements, partial agreements, or value specific agreements. Because the OUTCOME frequencies are not known in advance, they are either estimated from samples, comparable studies, or based on frequency of distribution of the OUTCOMES.
Separation: Once the strength of each identifier is calculated, each pair of records can be given a linkage score, which measures the probability that those two records refer to the same person. Upper and lower threshold scores are defined for either accepting or rejecting paired records. The records in between the thresholds must be reviewed manually to separate the links from the non-links.
Ties are much less common in probabilistic linkage because records would have to agree and disagree on exactly the same identifiers, not just the same number.
- Black C, Roos LL. "Linking and combining data to develop statistics for understanding the population's health." In: Friedman DJ, et. al. (eds). Health Statistics: Shaping Policy and Practice to Improve the Population's Health. New York, NY: Oxford University Press; 2005. 214-240.(View)
- Muhajarine N, Mustard C, Roos LL, Young TK, Gelskey DE. Comparison of survey and physician claims data for detecting hypertension. J Clin Epidemiol 1997;50(6):711-718. [Abstract] (View)
- Roos LL, Walld R, Wajda A, Bond R, Hartford K. Record linkage strategies, outpatient procedures, and administrative data. Med Care 1996;34(6):570-582. [Abstract] (View)
- Roos LL, Wajda A. Record linkage strategies. Part I: Estimating information and evaluating approaches. Methods Inf Med 1991;30(2):117-123. [Abstract] (View)
- Roos LL, Wajda A, Nicol JP. The art and science of record linkage: methods that work with few identifiers. Comput Biol Med 1986;16(1):45-57. [Abstract] (View)
Manitoba Centre for Health Policy
Community Health Sciences, Max Rady College of Medicine,
Rady Faculty of Health Sciences,
Room 408-727 McDermot Ave.
University of Manitoba
Winnipeg, MB R3E 3P5 Canada