Random Sampling Macro
version 1.4
Charles Burchill
Manitoba Centre for Health Policy and Evaluation.
October 18, 1995
This program is used to extract a user defined random sample from a
SAS dataset.
Thanks to Ruth, Shelley, and Randy for their suggestions. Much of the
code was modified from the SAS Applications Guide, 1987 Edition pp. 227-231.
Call:
_random options ;
Options:
data= Data set name, default is the last open dataset.
output= Output dataset name (required).
seed= Seed value for SAS random number lookup, default system time.
sample= Size of the sample, total or within in each by group.
percent= Size of sample based on % of dataset size, or by group size.
One of Sample or Percent is required, if both are
provided sample is used.
by= Variable(s) to sample by.
Multiple variables must be enclosed in quotes
(e.g. by="hosp agegrp").
min= Minimum size in a % sample by group.
debug= Turn on or off debuging (=debug, =nodebug).
Example Calls:
* Randomly sample claims within each hospital
With at least 5 samples (if there are that many). ;
_random percent=50 seed=5 min=5
data=test by=hosp
output=dump ;
* Randomly select 100 claims from a dataset;
_random sample=100
seed=5
data=hosp
output=sample ;
Notes:
- Samples are defined in the following ways:
1. Normal sample (no by groups). This dataset is used with the point
command to select observations out of non-compressed
SAS datasets. The sample is selected according to the
probabilities conditional on the number of observations
remaining in the data set, and the number needed to complete
the sample.This method means that the dataset does not
have to be sorted or loaded into a temporary data set first.
It is much faster than adding a variable containing a random
number, sorting and selecting.
2. If the dataset is compressed, or if it is a SAS view the
data is read into a temporary dataset first. The sample
is selected as above.
3. By variable samples. The data set is sorted by the by group
list into a temporary data set. The sample is selected
according to the probabilities conditional on the number of
observations remaining in that by group, and the number needed
to complete the sample.
- Percent sample sizes are based on the data set size (or the
by group size). This means a 20% sample of 10000 observations
will be 200. This program does not use an approximation.
- Remember that when you sub-sample with by groups that the total
size of the output dataset may not be a multiple of the sample
size, or the percentage of the original.
1. Some by groups may not be large enough to draw a complete sample
2. The percent sample is selected as a percent of each by group
not the total dataset.
- If any by group is smaller than the defined sample, or minimum
sample size the macro will return a note.