Max Rady College of Medicine

Concept: Generalized Estimating Equations (GEE)

 Printer friendly

Concept Description

Last Updated: 2002-11-01

Introduction

    The work presented in this concept is based on that carried out by Carriere et al. (2000) ; for details regarding the programming for that project, please see Across Time & Space .

    Comparing utilization rates across quintile groups or regions is traditionally done using the direct standardization approach that adjusts for confounding discrete factors such as age and sex. A model-based approach, on the other hand, can adjust for continuous as well as discrete factors and the GEE method of parameter estimation specifically is more efficient for statistical hypothesis testing with correlated longitudinal data.

    The GEE method was used to analyze variation in hospital utilization rates for Winnipeg residents across fiscal years 89/90 to 96/97 and income quintile by Carriere et al. (2000) . It was also used to analyze variation in health status (morbidity and mortality) measures across three healthiness regions of Winnipeg and non-Winnipeg in the deliverable "Why is the Health of Some Manitobans not Improving?"

    Generalized Estimation Equations (GEE) are methods of parameter estimation for correlated data. When data are collected on the same units across successive points in time, these repeated observations are correlated over time. If this correlation is not taken into account then the standard errors of the parameter estimates will not be valid and hypothesis testing results will be non-replicable.

    There is often some confusion between estimating equations and models. A regression model is made up of the dependent variable, the independent variables and the random error. Methods of parameter estimation like GEE are different from the models themselves.

    Most researchers who will use GEE are concerned with the issues of model specification and hypothesis testing, not with the methods used to estimate the parameters. The discussion below is not limited to these two issues.

Generalized Linear Models and GEE

    A probability distribution is said to be a member of the generalized linear models (GLM) if it is a member of an exponential family. Examples of such probability distributions are the normal, Poisson, binomial, and negative binomial. To define a GLM one needs to define the following:

    • The distribution of the dependent variable, which must be a member of the exponential family
    • The link function
    • The independent variables

    Some examples are provided below:
    Distribution Link Function
    Normal Identity
    Poisson Log
    Negative Binomial Log

    GEE was introduced by Liang and Zeger (1986) as a method of estimation of regression model parameters when dealing with correlated data. Regression analyses with the GEE methodology is a common choice when the outcome measure of interest is discrete (e.g., binary or count data, possibly from a binomial or Poisson distribution) rather than continuous.

    To define a regression model using the GEE methodology, one needs to define the following:

    • The distribution of the dependent variable (which must be a member of the exponential family)
    • The link function
    • The independent variables
    • The covariance structure of the repeated measurements.

Clustering

    Usually a longitudinal study observes subjects over time and traces person-specific change or growth. However, person-specific correlation and change is less relevant for studying change in health care utilization in the population over time; furthermore, the correlations over time are often found to be quite small and assumed negligible (ref 1). Given negligible patient-specific correlations, we seek to create clusters whose units are highly homogeneous with each subgroup. For example, utilization by 0-2 year olds in one year is likely to be similar to the utilization by 0-2 year olds in the other years and so on. This grouping definition created clusters that are highly alike with respect to utilization across time. Summarizing the data in this way does not result in much loss of information about utilization and allows analysis of very large data files.

    In applying the GEE method to hospital utilization data (ref 1), individuals were stratified by age (0,1,2,..85+), sex, income quintile, and time of measurement (FY 89/90 - 96/97), recording the average of events (average number of hospital discharges, average hospital days, and so forth). Hence, the data formed 85x2x5=860 clusters identified by age group/sex/income quintile and assumed to be independent with each other.

    In applying the GLM framework with GEE estimation to health status measures, individuals were stratified by age, sex, region, and year. The total number of events (e.g. total deaths, hospitalizations for AMI) was obtained for age group/sex/region clusters. Age was in 5-year groupings.

    This approach goes beyond simple age-sex adjustment in supporting analyses of appropriate age-sex specific data, dealing with changes in the overall size of the population served, and accommodating potential changes in its health status over time. The data need only be divided into subgroups fine enough to include such key characteristics as age, sex, income level or region of residence and to satisfy the large sample assumption in the approach. The approach can easily be extended to include other important individual- level variables. This strategy adjusts for both continuous and dichotomous covariates while accommodating longitudinal measurements for each cluster.

Analyzing Longitudinal Data Using SAS

    The GENMOD procedure in SAS uses GEE methodology to estimate the regression parameters.

    Before this procedure can be implemented, the data set needs to be structured in such a way that SAS recognizes that repeated observations are present for each unit. Each record corresponds to the measure(s) for a single unit at only one point in time. For example, five repeated measurements for the i th unit are assigned into five different records. An identification variable id is needed to link measurements to units, and a time variable (year) is used to order the successive measurements for each unit.
    Example:

    This fictitious data contains counts of the number of deaths and population size in each Regional Health Authority (RHA) across successive years. The purpose of the subsequent analysis is to model mortality as a function of age and sex. Each unit comprises an age/sex/RHA sub-segment of the population.
    id year agegrp sex rha deaths pop
    1 1985 1 1 Central 15 4110
    1 1986 1 1 Central 10 4141
    1 1987 1 1 Central 4 4095
    1 1988 1 1 Central 8 4146
    1 1989 1 1 Central 9 4021
    240 1995 10 2 Win 39 23305
    240 1996 10 2 Win 47 23660
    240 1997 10 2 Win 46 23794
    240 1998 10 2 Win 40 24029
    240 1999 10 2 Win 37 24778

Hypothesis Testing and Model Specification

    One of the goals of inferential statistics is to define research questions that will enable us to draw conclusions about the population. For example, in a longitudinal study of mortality, one might wish to investigate whether the average mortality rate in one RHA is different from the average mortality rate in another RHA, and whether that difference has changed over time. This involves more than just looking at the regression parameters that are produced by the selected analysis procedure, it involves specifying a hypothesis of interest and identifying the regression parameters to test that hypothesis.

    In statistics, when models are expressed within the GLM framework, hypotheses can be tested on linear functions of the parameters. For example, suppose B 1 represents the relative rate (risk) of mortality for Central RHA for 1999 and B 2 represents the relative rate (risk) of mortality rate for Winnipeg RHA in this same year. To test the hypothesis of a difference in mortality risk between the two RHAs in this single year requires a specification of the hypothesis H 0 : B 1 = B 2 ., or H 0 : = B 1 - B 2 = 0. Thus, in general, the coefficients for linear hypotheses are some set of Ls such that H 0 : L 0 B 0 +L 1 B 1 +......+L k B k = 0 .

    In SAS, hypotheses testing can be carried out using either CONTRASTOR ESTIMATE statements. ESTIMATE statements produce both an estimate and a statistical test of linear combinations of fixed effects. CONTRAST statements only produce the statistical test results. In order to use these statements, one needs to specify an L vector for testing the hypothesis: H 0 : LB = 0.

    We will use the ESTIMATE statements in the following examples, because it provides more complete information for estimating the magnitude of differences between parameter estimates; this is often of great interest in presenting the results of statistical analyses in a policy environment.
    EXAMPLE

    To describe annual changes in the mortality of the RHAs in Manitoba after controlling for age and sex differences in the population, we use Poisson regression to model counts of the numbers of deaths in each unit (i.e., age/sex/RHA sub-segment) as a function of the independent variables of age, sex, and RHA. The population size for each cluster is included as an offset (constant term) in the model.

    proc genmod data =geedata;
    class year agegrp sex rha id;
    model deaths =year age sex rha year*rha /dist =p
    link =log
    offset =lpop
    ;
    repeated subject =id /type =exch;
    run;

    Note: Lpop = log(population).

    Now let us test the following hypotheses for the above model:

    Hypothesis 1:
    Is there a difference in the average mortality rate for 1995 to 1996 and 1998 to 1999 across the province? In other words, when age, sex, and region are held constant (or controlled for), are there differences in mortality between these two two-year time periods?

    This corresponds to the following hypothesis:
    H 0 : B 99 + B 98 = B 96 + B 95 or H 0 : (B 99 + B 98) - (B 96 +B 95 ) = 0

    The coefficients that will be used to specify the L vector are therefore:

      1999 1998 1997 1996 1995
    1999 1 0 0 0 0
    1998 0 1 0 0 0
    A=(99+98) 1 1 0 0 0
    Year=96 0 0 0 1 0
    Year=95 0 0 0 0 1
    B=(96+96) 0 0 0 1 1
    C=(A-B) 1 1 0 -1 -1

    The row C=(A-B) then gives the coefficients of the year variable (main effect) that will be in the ESTIMATE statement to be included in the above SAS program; just before the RUN statement. Thus, we have:

    proc genmod data =geedata;
    class year age sex reg id;
    model deaths =year age sex reg year*reg /dist =p
    link =log
    offset =lpop
    ;
    repeated subject =id /type =exch;
    estimate 'Average diff. in rates between (98-99) and (95-96)' year1 1 0 -1 -1 / divisor =2;
    run;

    Note that we have used the option divisor=2 because we are interested in average of the two years. Also in the table above we arranged the years in descending order because we have assumed that in the SAS programming of the input dataset, the years have been programmed such that PROC GENMOD would treat 1995 as the reference category.

    Hypothesis 2:
    Is there a difference in mortality rates for males and females for 1995 to 1996 and 1998 to 1999 across the province? In other words, when age and region are held constant (or controlled for), are the differences in mortality for males and females different between these two two-year time periods?

    This corresponds to the following hypothesis:
    H 0 : (B M*99 + B M*98 ) - (B M*96 + B M*95 ) = (B F*99 + B F*98 ) - (B F*96 + B F*95 ) or
    H 0 : (B M*99 + B M*98 ) - (B M*96 + B M*95 ) - (B F*99 + B F*98 ) + (B F*96 + B F*95 ) = 0

    M*(Year=xx) is the male by the given year interaction and F*(Year=xx) is for the female by year interaction. The coefficients that will be used to specify the L vector are therefore:

      M*99 F*99 M*98 F*98 M*97 F*97 M*96 F*96 M*95 F*95
    M 1 0 1 0 0 0 -1 0 -1 0
    F 0 1 0 1 0 0 0 -1 0 -1
    Diff (M-F) 1 -1 1 -1 0 0 -1 1 -1 1

    The following programming statements are used to create a test of this hypothesis:

    proc genmod data =geedata;
    class year age sex id;
    model deaths =year age sex year*sex /dist =p
    Link =log
    Offset =lpop
    ;
    repeated subject =id /type =exch;
    Estimate 'Male to Female ratios (1999 - 1998) vs (1996 - 1995)' year*sex 1 -1 1 -1 0 0 -1 1 -1 1/divisor=2;
    run;

Related concepts 

Related terms 

References 

  • Carriere KC, Roos LL, Dover DC. Across time and space: Variations in hospital use during Canadian health reform. Health Services Research 2000;35(2):467-487. [Abstract] (View)
  • Horrocks J. Generalized linear models (Generalized estimating equations notes from workshop, MCHP, Winnipeg, MB, January 5, 1997)(View)
  • Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986;73(1):13-22.(View)

Keywords 

  • health status
  • longitudinal studies
  • morbidity
  • mortality
  • statistics


Contact us

Manitoba Centre for Health Policy
Community Health Sciences, Max Rady College of Medicine,
Rady Faculty of Health Sciences,
Room 408-727 McDermot Ave.
University of Manitoba
Winnipeg, MB R3E 3P5 Canada

204-789-3819