Random Sampling Macro version 1.4 Charles Burchill Manitoba Centre for Health Policy and Evaluation. October 18, 1995 This program is used to extract a user defined random sample from a SAS dataset. Thanks to Ruth, Shelley, and Randy for their suggestions. Much of the code was modified from the SAS Applications Guide, 1987 Edition pp. 227-231. Call: _random options ; Options: data= Data set name, default is the last open dataset. output= Output dataset name (required). seed= Seed value for SAS random number lookup, default system time. sample= Size of the sample, total or within in each by group. percent= Size of sample based on % of dataset size, or by group size. One of Sample or Percent is required, if both are provided sample is used. by= Variable(s) to sample by. Multiple variables must be enclosed in quotes (e.g. by="hosp agegrp"). min= Minimum size in a % sample by group. debug= Turn on or off debuging (=debug, =nodebug). Example Calls: * Randomly sample claims within each hospital With at least 5 samples (if there are that many). ; _random percent=50 seed=5 min=5 data=test by=hosp output=dump ; * Randomly select 100 claims from a dataset; _random sample=100 seed=5 data=hosp output=sample ; Notes: - Samples are defined in the following ways: 1. Normal sample (no by groups). This dataset is used with the point command to select observations out of non-compressed SAS datasets. The sample is selected according to the probabilities conditional on the number of observations remaining in the data set, and the number needed to complete the sample.This method means that the dataset does not have to be sorted or loaded into a temporary data set first. It is much faster than adding a variable containing a random number, sorting and selecting. 2. If the dataset is compressed, or if it is a SAS view the data is read into a temporary dataset first. The sample is selected as above. 3. By variable samples. The data set is sorted by the by group list into a temporary data set. The sample is selected according to the probabilities conditional on the number of observations remaining in that by group, and the number needed to complete the sample. - Percent sample sizes are based on the data set size (or the by group size). This means a 20% sample of 10000 observations will be 200. This program does not use an approximation. - Remember that when you sub-sample with by groups that the total size of the output dataset may not be a multiple of the sample size, or the percentage of the original. 1. Some by groups may not be large enough to draw a complete sample 2. The percent sample is selected as a percent of each by group not the total dataset. - If any by group is smaller than the defined sample, or minimum sample size the macro will return a note.