Concept: Matching Cases to Controls Using a Direct Matching Method

Introduction

(internal access only).

Background Information on the General Methodology

Using SAS to Match Cases for Case Control Studies

A Non-SQL Method for Selecting Controls

The following information summarizes a step-by-step non-SQL method developed at MCHP for matching cases to controls and describes how the controls are selected.

Step 1: Identify Cases for Matching

The first task is to create a separate dataset of the cases. This dataset must contain the variables to be directly matched to the controls (i.e. index year, birth year, sex, FSA). These variables must have the same name and be of the same type on the cases and controls because they will be used in a merge statement. Other variables that should be kept are the individual case identifier (e.g. scrambled PHIN) and any other variables that will be used to determine if the case could be matched to a control. These other variables should be renamed on the case dataset set so that they do not over-write similarly named variables on the control dataset.
IMPORTANT NOTE: Do not keep extra variables that are not used to establish if a case can be linked to a control because this increases the risk of accidentally clobbering similarly named variables on the control dataset.
Generally speaking, the case dataset has one record per case ID. The macro developed in the SAS example code expects the cases dataset to have one record per ID. If you need to allow multiple records per case ID, then generate a new ID value that does identify each case record uniquely.

The case dataset is sorted by the direct matching characteristics and a random number to ensure random links. The RANUNI function in SAS is used so that the seed can be controlled and the matches can be replicated, so long as nothing else changes. For more information about the RANUNI function, see the on-line SAS support documentation at http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202926.htm .

Step 2: Identify an Eligible Pool of Controls

The pool of eligible controls must contain the variables to be directly matched to the cases. These variables must have the same name and be of the same type on the cases and controls because they will be used in a merge statement. Other variables that should be kept are the individual identifier and any other variables that will be used to determine if the case could be matched to a control. Make sure that the other variables not used in the direct matching are named differently to similar variables on the case dataset.

The control dataset may contain more than one record per control ID. This often happens if controls change geographical area over time. To handle this situation, you can generate one record per control per year with the opportunity for geographic area to change once per year.
The control dataset is sorted by the direct matching characteristics and a random number to ensure random links. The RANUNI function in SAS can be used so that the seed can be controlled and the matches can be replicated, as long as nothing else changes.

Step 3: Matching Controls to Cases Using the Point Method

In this method, an index dataset for the pool of eligible controls is created. This dataset contains one record for each combination of the directly matched characteristics. Additionally the start record number and end record number for each record in a particular group is recorded. The index dataset is merged back to the case dataset and each eligible control record is examined one by one to see if it is an appropriate match using the point method.

In addition to matching on the directly matched characteristics, controls may be tested for Manitoba Health insurance registration qualifications or other tests which cannot be done in a direct merge. Up to 10 appropriate controls are selected for each case. After this step of the selection is complete, each control is randomly selected once so that each control is only linked to only one case. Because up to 10 controls are selected per case, hopefully each case has at least one control left after this step. Then one control is randomly selected per case.

Step 4: Identify Unmatched Cases and Eligible Controls and Match Again

Because all the possible controls for each case are not selected in the first run of the SAS program, there may be cases which have not matched to a control after the first round. Therefore after each matching round, the macro identifies unmatched cases and unmatched eligible controls and then performs step 3 again. The macro iterates through this loop up to 10 times and stops when either all cases have successfully matched to one control, or when no successful matches are found, or the loop has gone around 10 times.

Step 5: Output Dataset Containing 1 Matched Control Per Case

At the end of the macro call, there will be an output dataset containing one matched control per case. If multiple controls are needed, the macro must be called again, once for each control required. For each additional call to the macro, the full case cohort dataset and the reduced eligible control dataset can be used so that the same control will not be matched to a case a second time.

Cautions

Because controls are matched randomly to cases you may not be able to replicate your results. The macro allows you to provide the seed so that the random number generator should provide the same ordering each time. However if anything changes, such as the number of cases or controls or the original order of the cases or controls, the subsequent matching process may produce different case to control matches than the original process.

Concept: Matching Cases to Controls Using a Direct Matching Method

Concept Description

Introduction

Background Information on the General Methodology

A Non-SQL Method for Selecting Controls

Step 1: Identify Cases for Matching

Step 2: Identify an Eligible Pool of Controls

Step 3: Matching Controls to Cases Using the Point Method

Step 4: Identify Unmatched Cases and Eligible Controls and Match Again

Step 5: Output Dataset Containing 1 Matched Control Per Case

Cautions

Related concepts

Related terms