Concept: Record Linkage / Data Linakge
Last Updated: 2006-04-06
Note : Soundex does not help when the variants do not sound alike, or start with different letters (e.g. Bill has code B4, and William has W45). This may be handled by tokenization, or converting all variants of a name to a token representing that name. All variants of Bill, William, Will, Willy, etc. could be converted to the same token (i.e. BILL). This should be used as more of a last ditch effort for the truly hard to link cases than an initial strategy, since an exact match on name is stronger than a match on token alone.Prototyping: This is simply developing the program on a small sample of data before running the entire linkage. Usually the size of the files in a typical linkage project makes development prohibitively expensive, computationally speaking. Anything you can do to reduce the turnaround time of trying a new strategy, running the program, looking at the results, and trying something else is well worth the effort. The time you spend waiting for the program to run is better spent interpreting the results and fine-tuning the linkage.
weight = log2 | OUTCOME frequency in LINKED pairs |
OUTCOME frequency in UNLINKABLE pairs |