Home
Contents
GENERAL GUIDELINES:
Windows in SAS
File management
The SAS Program
Program syntax
Debugging tips
USING SAS PROGRAMMING TO:
1. Prepare the data set
Types of data
Example programs
2. View the data
SAS Procedures
3. Explore the data
Numeric statistics
Frequency tables
4. Manipulate the data
Basic techniques
New variables
5. Adding Variables and
Observations to Data Sets
The SET Statement
The MERGE Statement
6. Data Processing
ARRAY Statement
Do Loops
By-Group Processing
RETAIN Statement
NON-PROGRAMMING
Alternatives
SAMPLE DATA SETS:
Height/weight
Height/weight/region
Simulated clinical data
Simulated Manitoba Health
|
I. DATA PREPARATION: TYPES OF DATA SETS
This section provides both general
guidelines and specific details on preparing data sets (both SAS and
non-SAS). For the data sets referenced in this manual, detailed instructions
are provided in other sections on how to prepare the htwt
data set, the clinical data
set, and the simulated Manitoba Health
data set.
Data sets can be thought of as a table having columns and rows, and consisting of
three main components:
1. Values |
Numbers and/or letters of the alphabet comprising the information in each cell
(column/row combination).
|
2. Variables |
Names assigned to columns of information; currently,
they can be up to 32 characters long (starting with a letter
or underscore). |
3. Observations (Records) |
Usually one line, or row, of information per person or event; each observation
(also known as a record) consists of a set of values.
|
Illustrated below, on the left, is the raw data
set htwt. The viewtable on the right shows how the data
become meaningful once the appropriate information on the data
has been specified to SAS.
M and F, for example, are the values
for the variable SEX; the next column represents
the values for the variable AGE. Each observation
is now consecutively numbered, in this case, from 1 to 18.
The first observation thus has a value of Aubrey for the variable
called NAME, a value of M for the variable SEX, a value of
41 for the variable AGE, a value of 74 for the variable HEIGHT,
and a value of 170 for the variable WEIGHT.
In other words, the first observation is an individual named
Aubrey who is a 41-year-old male who is 6'2" tall and weighs
170 pounds. (Normally a codebook will specify the units in
which the values are being measured. In this case, height,
for example, is measured in inches and weight is measured
in pounds.) |
|
The values for any given variable will have the following characteristics:
Length. Numeric values are stored in SAS as floating-point, or real
binary, numbers. According to the SAS Language Reference (1990:86), floating
point representation is "a form of storing in scientific notation" ("in which values
are represented as numbers between 0 and 1 times a power of 10") "except that
on most operating systems the base is not 10, but is either 2 or 16".
SAS assigns a default length of 8 bytes of space to numeric variables and, where
space permits, this need not be reduced. Cody and Pass (1995:276) indicate
that "this does not mean 8 significant figures; it means that 8 times 8, or
64 bits (8 bits per byte) are used to store the number". They add that
8 bytes "is equivalent to what used to be
called 'double-precision' in other languages. This will vary not only by
which computer language you are using, but on which computer and under what
operating system you are running".
4 bytes of space is generally sufficient for most numeric variables, but a
SAS Institute manual should generally be consulted before changing the
lengths of numeric variables because of the potential loss of precision.
- Numeric vs character values. If calculations are not necessary (e.g.,
values of 1 and 2 for GENDER), it is recommended that such numeric
values be assigned as character values (in the INPUT
statement when reading a raw data set, or using a PUT
statement within a DATA step when accessing a SAS data set).
Conversion, where possible, to character decreases space requirements.
Values for the variable GENDER, for example, if numeric, can take
up to 8 bytes of space, but assigning it as a character value
decreases its size to 1 byte. References to all character values
in a program must be enclosed in single quotes (e.g., if gender='1';).
- Missing values. Numeric missing values are denoted with a period (.)
while character missing values are denoted with a space in quotes (' ').
|
Analyses using SAS software require that the data be
in the form of a SAS data set. If the data are in this form, no data
preparation is needed; the data set can be easily viewed and explored
using a SAS program to create either: a) a permanent or b) a temporary
SAS data set. A permanent SAS data set is normally created
if it is known which subsets are being used for analysis (e.g., if
only bypass procedure is of interest, only the hospital records that
actually contain this surgery would be needed). Temporary
SAS data sets, on the other hand, last for the duration of the SAS
session, and are useful when developing and debugging programs. Non-SAS
data sets require additional preparation because they need to
be converted to SAS data sets.
A. PREPARING SAS DATA SETS.
To create a temporary SAS data set from another temporary
SAS data set requires only the DATA and SET statements and the one-word
name of the data set. To read or to create a permanent
SAS data set requires a third statement - LIBNAME - to tell SAS
where to find the data set.
- LIBNAME sasref 'c:\sasdir';
tells SAS that the user has chosen sasref as the name to
represent a directory on c: drive called sasdir. That is,
LIBNAME tells SAS where the permanent SAS data set is located
(or to be located).
- DATA two; (to create a temporary
SAS data set) OR
DATA sasref.two; (to create
a permanent SAS data set).
The DATA statement is required to tell SAS to create another,
new, SAS data set from the the data set specified in the SET statement
following.
- SET one; (to read/access
a temporary SAS data set) OR
SET sasref.one; (to read/access
a permanent SAS data set)
For permanent SAS data sets, the SET statement tells SAS: a) the
name assigned to the directory, or location, of the data set (sasref)
and b) the name of the permanent SAS data set (one). For
temporary SAS data sets only the one-word file name need be specified.
(SAS will automatically assign the temporary SAS data set to the
WORK library for the duration of the SAS session, but the temporary
data set can be referred to without specifying this library.)
The SET statement thus tells SAS to read (or access) the data
set, loading the information into memory so that the user can
view or manipulate it. Any changes made to the data set specified
in the SET statement will be reflected in the data set specified
in the DATA statement, NOT in the original data (unless it is
being saved with exactly the same name).
These keywords are illustrated in the accompanying SAS
program examples.
It can be possible to create a large number of temporary SAS data
sets in the course of a SAS session, but generally it is desirable
to conserve space. If the reason for creating the new data set is
to create more variables, for example, the same data set name can
be used (e.g., data one; set one;).
This will simply overwrite the previous data set. Subsetting the
data (e.g., keeping only age 65+) will also conserve space. Assigning
a new name in this case (e.g., data age65p;
set one;) will permit the user to access either data
set during the SAS session.
Note that two options can be useful when creating permanent SAS
data sets; both are placed in the DATA statement; for example: data
sasref.new (compress=yes label='Simulated MB Health data');
- compress - removes
the extra space used by non-filled or partially filled variables.
This option can reduce the size of a data set quite substantially,
but note that in certain cases, it can actually increase its size.
The log will provide this type of information so that any necessary
adjustments can be made.
- label - permits adding
a brief description of the data set; this information would then
be seen in output generated by PROC CONTENTS.
Additional information on reducing the space taken up by SAS data
sets is available from the MCHP document titled
"Saving Space in SAS".
B. PREPARING NON-SAS DATA SETS
The non-SAS data set, which can be converted to a temporary or
permanent SAS data set, can take one of two forms: I) an ASCII file
or II) a file generated by another software package.
I. A file of ASCII (raw) data This may
look something like the following: |
12 38 8 011275
22 18 9 000088
31 0 4 100
These numbers cannot be meaningfully
manipulated unless the user is given additional information
to tell SAS the variable names, their locations, and whether
they should be read in as numeric or character. The
SAS program must use an INPUT statement to provide SAS with
this information, as well as FILENAME and INFILE statements
(the simulated Manitoba Health data
set is an example of this type of data set):
- A FILENAME statement is necessary to specify which
file contains the raw data. The statement FILENAME
rawref 'c:\sasref\rawdata'; provides both the
location (sasref directory) and the file containing
the raw data (rawdata).
- An INFILE statement is used to indicate that a
raw data set should be read in by SAS, as specified
in the FILENAME statement (INFILE is used to read raw data,
while SET is used to read in SAS data). INFILE
rawref; tells SAS to read the raw data set, name
and location as described in the FILENAME statement.
- An INPUT statement is used to
provide SAS with variable names, column numbers, and numeric/character
information. For example, INPUT
age 14-15 gender $20; indicates that age is two
columns, or spaces, in width, starting at column 14 (this
would not accommodate ages with 3 digits, i.e., ages>99),
and that gender is 1 column wide, located at column 20.
The dollar sign denotes that gender is to be read in as
a character variable while the absence of a dollar sign
indicates that age is to be read in as a numeric variable.
Data values can also be located within a SAS program. In
this case, a CARDS keyword is necessary to signal SAS that
raw data values are to be read in within the program. The
Height/Weight and clinical
data sets illustrate this approach.
|
II. A file generated by another software package
While ASCII files and input statements are the most commonly used
method for importing data into SAS, many other programs store information
in a format that SAS can read. These programs often keep information
on variable type, length, and format. SAS can access information
stored in formats saved by other programs in several ways.
1. Using delimiters in a 2-step process
a) Convert the external file to ASCII (text) format.
The external file from the native program is saved in ASCII format,
with a special character or delimiter between each field (variable).
Quote/comma and tab delimited are the most common types of delimiter.
For example, delimited ASCII files can be saved from Excel 2002
by selecting 'Save as' and saving the file with a type of Text (Tab
delimited)(*.txt).
b) Import the file into SAS.
2. Direct Access Using Engines
A format-specific engine on a LIBNAME statement will permit reading
some data file formats directly. Generally files must be saved in
a general or portable format prior to importing them into SAS. SPSS
is used here as an example but other engines are available. Prior
to importing an SPSS file it must first be saved as an SPSS Portable
format (.por) file from within SPSS (note that the SPSS engine does
not work under SAS 6.12 and Solaris (UNIX)). A SAS program could
then be used to read in the file, for example:
LIBNAME IMPORT SPSS 'c:\temp\spssfile.por' ;
** Note: file must be in SPSS portable format ;
DATA TEST ;
SET IMPORT._FIRST_ ;
** _first_ is the first, and only, data set
in the library, or directory. ;
RUN ;
or, PROC CONVERT can be used with the following syntax:
FILENAME IMPTSP 'c:\temp\spssfile.por' ;
PROC CONVERT SPSS=IMPTSP DATA=TEST ;
RUN;
3. Importing data into SAS PROC IMPORT may be used to import
a data set, or a data table, from a variety of different sources.
In some cases, the specific rows and columns will have to be defined;
in other cases whole tables can be imported. PROC IMPORT can be
used to import files created from such programs as MS Access (ACCESS),
DBase (DBF), Lotus (WK1, WK3, WK4), and Excel (EXCEL, EXCEL4, EXCEL5,
EXCEL97, EXCEL 2002). ASCII delimited files can also be imported
as delimited files (DLM, CSV, TAB). The Access component
of SAS software must be installed and available for each file type.
Data can be imported:
PROC IMPORT
OUT= htwt2
DATAFILE= "x:\sasdir\htwt.xls"
DBMS=EXCEL2002 REPLACE;
GETNAMES=YES;
RUN;
- the SAS menu, i.e., this procedure can be used interactively
through the File/Import/Standard Data Source option.
Home
SAS Program: Debugging Tips |
NEXT
Ib. Data Preparation: Example Programs |
|
Contact: Charles Burchill
Telephone: (204) 789-3429
Manitoba Centre
for Health Policy
Department of Community Health Sciences,
University of Manitoba
4th floor Brodie Centre
408 - 727 McDermot Avenue
Winnipeg, Manitoba
R3E 3P5
Fax: (204) 789-3910
|