Concept: Merging SAS® Datasets
Last Updated: 2002-03-28
SAS merges two datasets by matching the BY groups on one dataset to the corresponding BY groups on the other. A BY group is a unique combination of merge keys, called BY variables.
Commonly used BY variables are PHIN, name, sex, age, date, or registration number. Since SAS only recognizes exact matches on BY variables, they have to be identical on the following characteristics:
- BY value (the actual value of the variable)
- Variable name
- Type (character or numeric)
- Left or right justification (for character variables)
Date can be unreliable as a merge key because processing delays can result in differences across files. Name can also be difficult to use because a person can go by one name on one file, and another name (middle name, nickname, etc.) on the other.
It is useful to create intermediate variables as merge keys before attempting to merge, making sure that the above criteria have been met.
Keep in mind, however, that numeric variables with decimals must match exactly, even down to the n th decimal place, so avoid using the results of complex mathematical expressions as merge keys.
Types of Merges
The types of merges that SAS can do are listed below. The letters refer to the number of observations in each BY group.
- Few to many A:B, where A < B
- Many to few B:A
- Many to many A:A
- One to many 1:A
- Many to one A:1
- One to one 1:1
The first two are to be avoided because of the undesirable side-effects of the imbalance between the two datasets. Strange things can happen in these merges.
Often what is desired in these cases is an "all pairs" merge, where the product of the two datasets is formed. SAS does not do this automatically (with the exception of SQL), and it takes some ugly code to get the job done in straight SAS. These can be useful in merging medical to hospital claims, but this is not simple.
Without a BY statement, SAS does a one to one merge. While sometimes this is appropriate (eg. merging DRGs to hospital claims, where the two files are known to be 1:1), usually it is not.
The 1:A or A:1 merges are usually okay, since SAS does not have a lot of room to get creative. But remember that a merge may be A:1 in one BY group, and 1:A in another, so even these can be tricky.
- A:1 and 1:A have to be treated as separate merges is because the order matters. Yes, merging is non-commutative! The statement merge A B; is different from merge B A;
The difference lies in the treatment of 'overlapping variables', which are variables that have the same name on both datasets (but are not merge keys). Generally, but not always, the value from the last dataset on the MERGE statement is kept.
Avoid overlapping variables either by dropping or renaming one. Try experimenting with different combinations of these to get a feel for what they do.
Alternatives to Merging
Other ways to get two datasets together without merging are:
The Registry system is based on pointers. They are very fast for static files with known relationships between the records. Each record carries a pointer to the next record in the chain, or to another record on a different file. SAS can then jump directly to the appropriate point in the file.
SQL is more flexible than merging, but it can be tricky to specify exactly the kind of merge you want. It can be very resource intensive if it tries to form all possible pairs of records from two heavyweight datasets.
Interleaving two datasets with multiple SET statements could possibly avoid the need for a merge, depending on your requirements.
Manitoba Centre for Health Policy
Community Health Sciences, Max Rady College of Medicine,
Rady Faculty of Health Sciences,
Room 408-727 McDermot Ave.
University of Manitoba
Winnipeg, MB R3E 3P5 Canada