| This section provides both general
guidelines and specific details on preparing data sets (both SAS and
non-SAS). For the data sets referenced in this manual, detailed instructions
are provided in other sections on how to prepare the htwt
data set, the clinical data
set, and the simulated Manitoba Health
data set.
Data sets can be thought of as a table having columns and rows, and consisting of
three main components:
|
1. Values |
Numbers and/or letters of the alphabet comprising the information in each cell
(column/row combination).
|
|
2. Variables |
Names assigned to columns of information; currently,
they can be up to 32 characters long (starting with a letter
or underscore). |
|
3. Observations (Records) |
Usually one line, or row, of information per person or event; each observation
(also known as a record) consists of a set of values.
|
| Illustrated below, on the left, is the raw data
set htwt. The viewtable on the right shows how the data
become meaningful once the appropriate information on the data
has been specified to SAS.
M and F, for example, are the values
for the variable SEX; the next column represents
the values for the variable AGE. Each observation
is now consecutively numbered, in this case, from 1 to 18.
The first observation thus has a value of Aubrey for the variable
called NAME, a value of M for the variable SEX, a value of
41 for the variable AGE, a value of 74 for the variable HEIGHT,
and a value of 170 for the variable WEIGHT.
In other words, the first observation is an individual named
Aubrey who is a 41-year-old male who is 6'2" tall and weighs
170 pounds. (Normally a codebook will specify the units in
which the values are being measured. In this case, height,
for example, is measured in inches and weight is measured
in pounds.) |
|
| Analyses using SAS software require that the data be
in the form of a SAS data set. If the data are in this form, no data
preparation is needed; the data set can be easily viewed and explored
using a SAS program to create either: a) a permanent or b) a temporary
SAS data set. A permanent SAS data set is normally created
if it is known which subsets are being used for analysis (e.g., if
only bypass procedure is of interest, only the hospital records that
actually contain this surgery would be needed). Temporary
SAS data sets, on the other hand, last for the duration of the SAS
session, and are useful when developing and debugging programs. Non-SAS
data sets require additional preparation because they need to
be converted to SAS data sets.
A. PREPARING SAS DATA SETS.
To create a temporary SAS data set from another temporary
SAS data set requires only the DATA and SET statements and the one-word
name of the data set. To read or to create a permanent
SAS data set requires a third statement - LIBNAME - to tell SAS
where to find the data set.
- LIBNAME sasref 'c:\sasdir';
tells SAS that the user has chosen sasref as the name to
represent a directory on c: drive called sasdir. That is,
LIBNAME tells SAS where the permanent SAS data set is located
(or to be located).
- DATA two; (to create a temporary
SAS data set) OR
DATA sasref.two; (to create
a permanent SAS data set).
The DATA statement is required to tell SAS to create another,
new, SAS data set from the the data set specified in the SET statement
following.
- SET one; (to read/access
a temporary SAS data set) OR
SET sasref.one; (to read/access
a permanent SAS data set)
For permanent SAS data sets, the SET statement tells SAS: a) the
name assigned to the directory, or location, of the data set (sasref)
and b) the name of the permanent SAS data set (one). For
temporary SAS data sets only the one-word file name need be specified.
(SAS will automatically assign the temporary SAS data set to the
WORK library for the duration of the SAS session, but the temporary
data set can be referred to without specifying this library.)
The SET statement thus tells SAS to read (or access) the data
set, loading the information into memory so that the user can
view or manipulate it. Any changes made to the data set specified
in the SET statement will be reflected in the data set specified
in the DATA statement, NOT in the original data (unless it is
being saved with exactly the same name).
These keywords are illustrated in the accompanying SAS
program examples.
It can be possible to create a large number of temporary SAS data
sets in the course of a SAS session, but generally it is desirable
to conserve space. If the reason for creating the new data set is
to create more variables, for example, the same data set name can
be used (e.g., data one; set one;).
This will simply overwrite the previous data set. Subsetting the
data (e.g., keeping only age 65+) will also conserve space. Assigning
a new name in this case (e.g., data age65p;
set one;) will permit the user to access either data
set during the SAS session.
Note that two options can be useful when creating permanent SAS
data sets; both are placed in the DATA statement; for example: data
sasref.new (compress=yes label='Simulated MB Health data');
- compress - removes
the extra space used by non-filled or partially filled variables.
This option can reduce the size of a data set quite substantially,
but note that in certain cases, it can actually increase its size.
The log will provide this type of information so that any necessary
adjustments can be made.
- label - permits adding
a brief description of the data set; this information would then
be seen in output generated by PROC CONTENTS.
Additional information on reducing the space taken up by SAS data
sets is available from the MCHP document titled
"Saving Space in SAS".
B. PREPARING NON-SAS DATA SETS
The non-SAS data set, which can be converted to a temporary or
permanent SAS data set, can take one of two forms: I) an ASCII file
or II) a file generated by another software package.
| I. A file of ASCII (raw) data This may
look something like the following: |
12 38 8 011275
22 18 9 000088
31 0 4 100
These numbers cannot be meaningfully
manipulated unless the user is given additional information
to tell SAS the variable names, their locations, and whether
they should be read in as numeric or character. The
SAS program must use an INPUT statement to provide SAS with
this information, as well as FILENAME and INFILE statements
(the simulated Manitoba Health data
set is an example of this type of data set):
- A FILENAME statement is necessary to specify which
file contains the raw data. The statement FILENAME
rawref 'c:\sasref\rawdata'; provides both the
location (sasref directory) and the file containing
the raw data (rawdata).
- An INFILE statement is used to indicate that a
raw data set should be read in by SAS, as specified
in the FILENAME statement (INFILE is used to read raw data,
while SET is used to read in SAS data). INFILE
rawref; tells SAS to read the raw data set, name
and location as described in the FILENAME statement.
- An INPUT statement is used to
provide SAS with variable names, column numbers, and numeric/character
information. For example, INPUT
age 14-15 gender $20; indicates that age is two
columns, or spaces, in width, starting at column 14 (this
would not accommodate ages with 3 digits, i.e., ages>99),
and that gender is 1 column wide, located at column 20.
The dollar sign denotes that gender is to be read in as
a character variable while the absence of a dollar sign
indicates that age is to be read in as a numeric variable.
Data values can also be located within a SAS program. In
this case, a CARDS keyword is necessary to signal SAS that
raw data values are to be read in within the program. The
Height/Weight and clinical
data sets illustrate this approach.
|
II. A file generated by another software package
While ASCII files and input statements are the most commonly used
method for importing data into SAS, many other programs store information
in a format that SAS can read. These programs often keep information
on variable type, length, and format. SAS can access information
stored in formats saved by other programs in several ways.
1. Using delimiters in a 2-step process
a) Convert the external file to ASCII (text) format.
The external file from the native program is saved in ASCII format,
with a special character or delimiter between each field (variable).
Quote/comma and tab delimited are the most common types of delimiter.
For example, delimited ASCII files can be saved from Excel 2002
by selecting 'Save as' and saving the file with a type of Text (Tab
delimited)(*.txt).
b) Import the file into SAS.
2. Direct Access Using Engines
A format-specific engine on a LIBNAME statement will permit reading
some data file formats directly. Generally files must be saved in
a general or portable format prior to importing them into SAS. SPSS
is used here as an example but other engines are available. Prior
to importing an SPSS file it must first be saved as an SPSS Portable
format (.por) file from within SPSS (note that the SPSS engine does
not work under SAS 6.12 and Solaris (UNIX)). A SAS program could
then be used to read in the file, for example:
LIBNAME IMPORT SPSS 'c:\temp\spssfile.por' ;
** Note: file must be in SPSS portable format ;
DATA TEST ;
SET IMPORT._FIRST_ ;
** _first_ is the first, and only, data set
in the library, or directory. ;
RUN ;
or, PROC CONVERT can be used with the following syntax:
FILENAME IMPTSP 'c:\temp\spssfile.por' ;
PROC CONVERT SPSS=IMPTSP DATA=TEST ;
RUN;
3. Importing data into SAS PROC IMPORT may be used to import
a data set, or a data table, from a variety of different sources.
In some cases, the specific rows and columns will have to be defined;
in other cases whole tables can be imported. PROC IMPORT can be
used to import files created from such programs as MS Access (ACCESS),
DBase (DBF), Lotus (WK1, WK3, WK4), and Excel (EXCEL, EXCEL4, EXCEL5,
EXCEL97, EXCEL 2002). ASCII delimited files can also be imported
as delimited files (DLM, CSV, TAB). The Access component
of SAS software must be installed and available for each file type.
Data can be imported:
PROC IMPORT
OUT= htwt2
DATAFILE= "x:\sasdir\htwt.xls"
DBMS=EXCEL2002 REPLACE;
GETNAMES=YES;
RUN;
- the SAS menu, i.e., this procedure can be used interactively
through the File/Import/Standard Data Source option.
Home
SAS Program: Debugging Tips |
NEXT
Ib. Data Preparation: Example Programs |
|