| SAS provides for many optional statements and keywords that can be used
in SAS programs to facilitate manipulation and display of the data.  
Statements can often (but not always) be entered in any order within DATA and 
PROC steps, while options must usually be placed in a specific 
position within a SAS statement. 
The data= option, for example, can be
added to most procedures to specify the data set on which it should be
run, e.g., proc freq data=test;.  
Unless the data set is specified (in this case, "test"), SAS will automatically
go to the most recently created data set. 
Two broad categories of statements/keywords are 
described here - those that can be used to: a) create subgroups of data, and 
b) customize display of output.  A SAS program incorporating the use of these
statements/keywords follows.
 
A. Create Subgroups of Data
 
Analysis and space requirements will often dictate whether to create separate SAS data sets
(temporary or permanent) for analysis or to simply split the output by the desired values of a variable.
If all analysis is to be conducted on individuals age 65+, for example, it might
be desirable to create a separate, permanent SAS data set, removing all records
having an age of less than 65.
Not only is the data set being tailored to meet analysis needs,
program efficiency is also enhanced by reducing the amount of time and space
being used to carry out SAS runs - extremely important if computer resources
are limited in terms of both physical and memory space. 
  
Three approaches to creating subgroups of data are described here, specifying
1) data values, 2) variable names, or 3) number of observations to reduce the 
data set.
  
              Specify data values with the WHERE 
                or IF statements to keep a 
                subset of records, or observations. Although only the observations 
                containing the specified value(s) will be kept, all other variables 
                associated with these observations will also be kept. 
                 where age>=65; 
                  This can be used within a PROC or DATA step to keep only those 
                  records having an age of 65 or older within a DATA step. 
                  if age>=65; This 
                  can only be used in a DATA step; however, the advantage of using 
                  IF is that a number 
                  of conditions can be specified: 
                 
                   if age>=65 and gender='F'; 
                    This statement tells SAS to keep all females who are 65 years 
                    of age and older. 
                   if age>=65 or gender='F'; 
                    This statement tells SAS to keep all females and all people 
                    (both male and female) who are 65 years of age and older.
                 IF statements can be used on variables 
                  created in the current data step. WHERE 
                  statements can only be applied to variables that pre-exist in 
                  the data.  
              Specify variable names using the KEEP 
                keyword/statement (to keep selected variables) or the DROP 
                keyword/statement (to drop selected variables) to create a subset 
                of variables. 
                e.g., data new; 
                  set test 
                  (keep=regionre los);
 run;
 This above is an an example of using KEEP as a keyword; it keeps 
                  2 variables from test in the "new" data set as the data 
                  set is being read in (the other variables, however, are still 
                  accessible for other data steps). (Alternatively, the KEEP statement 
                  can be placed at the end of the DATA statement, but this is 
                  less efficient because all variables in the test data 
                  set will be processed.)
 e.g., data test2; 
                  set test;
 drop 
                  drg drgrgn drgw;
 run;
 The KEEP and DROP keywords can also be used as a statement. 
                  In this case, only the DROP keyword is used and 3 variables 
                  are dropped from the new test2 data set. This statement 
                  is often used when new variables have been created from existing 
                  variables, and the existing variables are no longer necessary 
                  for analysis.
 e.g., input age 7-9 regionre 51 
                  deathsep 55-58;  When reading in raw data (e.g., the simulated  
                  Manitoba Health data), only the variables necessary to the 
                  analysis need to be read in.
 Note that use of the DROP and KEEP keywords only affect variables; 
                  they do not affect the number of observations. 
                  
              Specify number of observations with the (OBS=) 
                option to keep a subset of records. 
                e.g., data test; 
                  set test 
                  (obs=10);
 run;
 This option is very useful when testing/debugging SAS programs 
                  or portions of code. It is easily removed, and the program can 
                  be re-submitted to obtain output for all the observations. The 
                  following illustrates how the option can also be used when reading 
                  in raw 
                  data.
 e.g., data test; 
                  infile 
                  rawfile (obs=10);
 
B. Customize Display of Output
 
Output can be enhanced in a number of ways, only a few of which are presented
here:  1) LABEL, 2) FORMAT, 3) TITLE, and 4) FOOTNOTE statements.
 
  Change how variable information is displayed by using a 
      LABEL statement.
      e.g.,  label height = 'Height in inches'
      weight = 'Weight in inches';
 This code will attach labels to height and weight 
                  so that the labels rather than just the variable names will 
                  be displayed in any output. Labels currently can be up to 256 
                  characters long, and must be enclosed in single quotes (double 
                  quotes if there is an apostrophe in the label). The LABEL statement 
                  usually goes in the DATA step, near the end, and is very helpful 
                  for explaining what the variables represent, particularly if 
                  other users will be accessing the data.
 
  Change how data values are displayed by using the 
      FORMAT statement.
      e.g., format gender $genderl. regionre regionh $regionl.;
      This FORMAT statement assumes that formats called  $gender 
                  and $regionl have been created (using PROC 
                  FORMAT). The variable(s) is specified first, and then the 
                  format, which always has a period at the end of it (at least 
                  one space between each). The $ denotes a character format; it 
                  can only be used with character variables. The FORMAT statement 
                  will result, for example, in the values of the variables regionre 
                  and regionh being displayed as full region names rather 
                  than '1' through '8' although certain SAS procedures may truncate 
                  the formatted values from the maximum allowable 32 characters 
                  long to 8 or 16 characters in length).
 The FORMAT statement can be used within a PROC or DATA step.
      If used within a DATA step, the format is applied
      in all procedures referencing the data set.  If used within a PROC step
      (generally preferred), the format is only applied for that specific procedure.
      The original, underlying values are NOT permanently changed;
      only how they appear in output is changed. 
       
      Note that SAS also has its own library of formats that are available 
      throughout any SAS session.
       
  Enhance output by adding a title(s) using the 
      TITLE statement.
      e.g., title1 'An example of a title';
      can be used within a PROC or DATA step, or by itself, to place a title
      on each page of subsequent output. Note that:
       Titles are enclosed in single quotes; anything within the quotes
              is not processed by SAS.  Double quotes are used if the title
              contains any apostrophes.
          Up to 10 titles can be added, using up to 10 TITLE statements.
              Each additional title is numbered, e.g.,
             title2 'An example of a 2nd title';  
          TITLE statements are global statements, which means that they are
              displayed on every subsequent page of output in a SAS session.
              They can be either overwritten (by specifying new TITLE statements)
              or cleared (by specifying in the program e.g., title1;, to clear a
              TITLE1 statement, title2; to clear a TITLE2 statement, and so on).
      
 
  Enhance output by adding a footnote(s) using the 
      FOOTNOTE statement.
      e.g., footnote1 'An example of a footnote';
      can be used within a PROC or DATA step, or by itself, to place a footnote
      on each page of subsequent output. The above notes for the TITLE statement also apply
      to the FOOTNOTE statement.
 
 
                | 
*******************************************************
*This program creates two data sets "men" and "women" *
*and generates 2 versions of tables showing the       *
*distribution of name separately for males and for    *
*females over age 40. It assumes that the data set    *
*"htwt" has been previously created. It also assumes  *
*that labels have not yet been created for the        *
*variables.                                           *
*******************************************************;
proc format;          /*Create label format for sex*/
  value $sexl 'M'='Male'
              'F'='Female';
/*Create 2 new temporary SAS data sets*/
data men women;
        /*Read in the "htwt" data set, keeping only
           3 of the variables and the first 10 records*/
  set htwt (keep=sex name age obs=10);
       /*For the "men" data set keep only the records
          that have a value of "M" for sex */
  if sex='M' then output men;
     /*For the "women" data set keep only the records
        that have a value of "F" for sex */
      /*(Note that records missing values for sex
            would not go into either data set) */ 
  else if sex='F' then output women;
  /*Create labels for the variables being kept*/
  label name = 'Name of individual'
        age    = 'Age at admission'
        sex = 'Gender of patient';
run;
 proc freq data=men;
   tables name;
   where age>=40;
   format sex $sexl.
  title1 'Example re use of basic techniques';
  title2 'The Student Project';
  footnote1 'Limiting age to 40+';
run;
 proc freq data=women;
   tables name;
   where age>=40;
   format sex $sexl.;
run;
***********************************************;
*Instead of the 2 PROC steps used above, the  *
*same analysis could be done as follows using *
*PROC SORT and BY variable processing.        *
***********************************************;
/*Sort the data by sex prior to creating 
tables that are split by sex */
proc sort data=htwt;
  by sex; 
run;
proc freq data=htwt;
  /* Create a table of distribution of name */
  tables name;  
  /* Do this separately for each value of sex*/
  by sex;    
   /* Do this only for age 40+*/
 where age>=40;  
    /* Display formatted values*/
  format sex $sexl. ;
  title1 'Example re use of basic techniques';
  title2 'The Student Project';
  footnote1 'Limiting age to 40+';
run;
 
 |  
 
PRACTICE QUESTIONS ON DATA MANIPULATION (BASIC)
              These questions assume that a permanent SAS data set has been created from 
              the sample clinical data. The format 
              file does not need to be included. Examples are given for how program, 
              log, and output 
              might look. 
             
  Create 3 separate temporary SAS data sets for each of the following and 
  carry out any SAS procedures that will indicate only the specified data was kept.
      only pregnant females.
          only data on blood pressure and heart rate data, along with the id number
          (for the whole sample).
          only the first 15 observations.
          
  
  Display the distribution of pregnant by vitamins.  Re-submit this table
  after adding the following:
      A new label for the vitamins variable:  Patient on vitamin therapy.
          New labels for the values of pregnant: label "1" as 3+ mos.pregnant
              and "0" as LT 3 mos. pregnant
          A title showing the source of data and a footnote reflecting
              the type of exercise.
           
 
 
                |  Home 
  IIIb. Data Exploration: Frequency Tables | NEXT 
  IVb. Data Manipulation: New variables | 
 
 
 |