Concept: LINKS: A Record Linkage Package
Last Updated: 2002-03-26
1. SAS® compatibility: The LINKS programs are written in SAS Macro language and can be used on any mainframe, workstation, or microcomputer which has the SAS system installed. This permits drawing on capabilities already present in SAS. Users who already know SAS will find the package especially easy to learn. The SAS technical support, training and documentation are already available, reducing the need for additional software support and specialized training.
2. Portability: LINKS is easily transferred onto a floppy disk or CD for uploading onto different platforms.
3. Flexibility: LINKS provides tools for both deterministic and probabilistic matching. Files of high-quality data with many variables are information rich: more than enough information is available with which to carry out the linkage. Such files may be linked deterministically (i.e. linking records if they agree uniquely on all or almost all variables). On the other hand, when linking datasets with numerous coding errors or few variables, probabilistic methods are likely more appropriate.
4. Safeguards for privacy: Patient confidentiality can be assured by not using names or addresses in the linkage process. Identification numbers and coded values make it nearly impossible for a breach in confidentiality to occur due to a misplaced printout.
5. Cost effectiveness: Computing time and researchers' efforts are used efficiently, thanks to LINKS modules which provide information on the structure of each dataset, the importance of each variable, and the feasibility of linkage. Researchers can assess the amount of information present before deciding whether to proceed with linkage, or to collect more information to improve linkage quality. This feedback allows the selection of a linkage strategy best suited to the information present in each dataset.
1. TESTPW: Prescreens the datasets and tells the user which variables would be most useful for the linkage.
2. TESTPK: Assists in finding combinations of variables that uniquely identify each individual.
The final module carries out the linkage:
3. LINKEX: Performs deterministic or probabilistic linkage using selected variables.
A graphical illustration of the linkage procedure is available here: Linkage Procedure
1. Prepare the SAS datasets to be used as input. Each of the input files should contain all relevant variables for linkage. Variable names should conform to standard SAS naming conventions.
2. Run TESTPW and TESTPK to assess the amount of information available and to obtain a preliminary idea of data structure and quality.
3. Decide on a linkage strategy based on the results of TESTPW and TESTPK. Determine which variables are most appropriate for linkage and which combination of variables will uniquely, or nearly uniquely, identify individuals.
4. Run the LINKEX module to carry out the linkage. The printed output shows the results of the deterministic and probabilistic linkage stages, as well as the calculation of weights. This may suggest a change in strategy, and the modified linkage can be rerun.
5. After satisfactory results are obtained, study the linked and unlinked records and report the results.
The TESTPW module prescreens the datasets to be used in the linkage and indicates which variables are likely to be most useful for linkage.B. The TESTPK module
TESTPW produces an output table which lists the frequency of all listed variables, the number of missing values, and two measures of the relative usefulness of each variable: Discriminating Power and Shannon Entropy. Larger values are better on both of these scores.
The Discriminating Power and Shannon Entropy statistics provide measures of how useful each variable is in distinguishing two individuals. They are related to the amount of information encoded by each variable, which can be thought of as the variable's unpredictability. A variable which has the same value for every person has, both formally and intuitively, an information content of zero; it is of no use in distinguishing two individuals. At the other end of the spectrum, a variable which has a different value for every person is perfectly able to distinguish any two records in the file.
Variables used in record linkage fall between these two extremes, having some but not perfect power to distinguish two individuals. Generally, variables with a large number of potential values, such as birth date, are better than those with few, such as sex. The distribution of the values is also important; if 90% of the records have the same birth date then that likely will not be a very useful variable.
The TESTPW procedure is of maximum value when there are a large number of candidate linkage variables. TESTPW assists the user in eliminating those variables which will be of minimal value for linkage, improving the speed and efficiency of the process.
Click here for syntax, sample program & output
The TESTPK module assists in finding the combination of variables which will uniquely identify each individual. It is normally run in conjunction with TESTPW.
TESTPK splits and sorts the dataset into pockets defined by a set of variables specified by the user.
A pocket is a group of records all having the same value on one or more variables. Using SEX as a pocket variable would split the file into two pockets. Using SEX and urban/rural would split the file into four, smaller, pockets: urban males, urban females, rural males and rural females.
The more variables used to define the pockets, the smaller they become. The goal is to find the minimal group of variables which divides the file into pockets each containing as few records as possible. If each pocket contains one record, then that set of pocket variables is able to uniquely identify individuals.
- Click for syntax, sample program & output.
The TESTPK procedure can be used to check if a given sequence of variables defines records uniquely. A set of variables that does this will have both MAX and MIN statistics equal to 1.
The information from the TESTPK procedure can be used in a number of ways:1. Checking to see if there are too few variables to identify records uniquely. If the mean number of records in each pocket is greater than 1, it indicates that additional variables are needed to identify records uniquely.
2. The number of pockets N is a good measure of the relative amount of computing resources needed for the linkage. In the above example, adding DYR to the pocket definition reduced the computing time needed by a factor of four (N=8 as opposed to N=2 for the pocket defined by SEX alone). Similarly, it can be seen that adding DMO to the pocket definition reduces the computing time by a factor of almost 100 (N=96) compared to the time needed if there were no pockets defined.
3. The statistics for the pocket defined by SEX DYR DMO DDA BYR BMO LCA and MST show that the average number of records in the pocket is 1, but that some pockets contain 2 records (MAX=2). While it is not impossible for two different records to agree exactly on all variables, it may indicate the presence of one or more duplicate records. It would be wise to review the data to make sure duplicate records have been eliminated.
The LINKEX module takes two SAS datasets and performs probabilistic or deterministic linkage.
Linked records, containing variables from both input datasets, are output to a file. Ties, where one record matches equally well to two or more records in the other file, are put in a separate file for manual resolution. Finally, unlinked records from each input dataset are output.Click here for syntax, sample program & outputFine Tuning the Linkage: The LPX and WTX statements
Because the linkage takes place entirely within the macros it is difficult to customize the formation of linkable pairs and weight calculations. Two additional LINKS statements give the user some ability to control these aspects of the linkage: _LPX and _WTX.
Both of these statements are optional. They can also really mess up the validity of your linkage, so be careful!
_LPX ('SAS Statement';);The _LPX statement takes as its argument a SAS-syntax statement, which will be executed during the formation of linkable pairs._WTX ('SAS Statement';);
Normally all pairs of records agreeing on the _BY variables and on more than LMIN _VAR variables are considered potential links. This is accomplished internally by creating a _MATCHED variable, which is the total number of variables agreeing for a given pair, and comparing this to LMIN. Any pair in which the _MATCHED value minus the number of _BY variables (because those are not counted in LMIN) equals or exceeds LMIN is kept for further processing.
A pair can be excluded from the matched group by setting the _MATCHED value to 0. This is useful for setting conditions on the matches which are difficult or impossible to construct based on straight agreement or disagreement of variables.
Example
A simple linkage of two datasets on SEX and AGE. SEX is required to match on all pairs, but agreement on AGE is optional:_LINKEX DATA1=D1 DATA2=D2;_LPX eliminates pairs that differ by more than 5 years (presumably because they represent unacceptably poor links which warrant no further investigation).
_BY SEX;
_VAR AGE;
_LPX 'if abs(AGE1-AGE2)>5 then _MATCHED=0;';
_RUN;
Notes
- Two semi-colons at the end of _LPX - one for the quoted SAS statement and one for the macro statement.
- Internally, AGE is referred to as AGE1 and AGE2, from DATA1 and DATA2, respectively. All _VAR variables are suffixed with 1 or 2 depending on which dataset they came from.
_LPX is perhaps most useful in eliminating undesirable links based on values or combinations of values of the linkage variables. It could also be used to include more pairs in the linkable group than would be included based on the LMIN criterion by setting _MATCHED equal to some sufficiently high value, but it is difficult to imagine a situation in which this would be useful.The _WTX statement takes as its argument a SAS-syntax statement, which will be executed during the calculation of probabilistic weights.
Weights are used to resolve ties by choosing pairs which are less likely to have the observed level of agreement by chance alone. The weights are calculated based on the observed percentages of agreement and disagreement among the linked and unlinked pairs.
With some variables, though, all levels of disagreement are not equally likely: small errors are generally more likely than large errors. Someone is more likely to be off by one year when asked for their birth date than 20 years.
The _WTX statement can be used to adjust the calculated weights up or down based on the values of the variables. During the resolution of the links, the pairs with higher weights will be chosen over those with lower weights.
Example:_LINKEX DATA1=D1 DATA2=D2;In this example, _WTX increases the weight for pairs with an age disagreement of one year by 0.2, and of two years by 0.1. The remaining pairs are unaffected. This has the effect of making small disagreements on age preferable to larger disagreements.
_BY SEX;
_VAR AGE;
_WTX 'if abs(AGE1-AGE2)=1 then_WGT=_WGT+.2;
else if abs(AGE1-AGE2)=2 then _WGT=_WGT+.1;';
_RUN;
The adjustment of the weights is typically small to avoid overpowering the effect of the other variables, and to avoid making small disagreements on age look better than exact matches. The goal is to give LINKS some idea which pairs you consider closer than others so it will make the right decisions during the resolution stage.