Concept: Generalized Estimating Equations (GEE)
Last Updated: 2002-11-01
GEE was introduced by Liang and Zeger (1986) as a method of estimation of regression model parameters when dealing with correlated data. Regression analyses with the GEE methodology is a common choice when the outcome measure of interest is discrete (e.g., binary or count data, possibly from a binomial or Poisson distribution) rather than continuous.
Distribution Link Function Normal Identity Poisson Log Negative Binomial Log
Example:
This fictitious data contains counts of the number of deaths and population size in each Regional Health Authority (RHA) across successive years. The purpose of the subsequent analysis is to model mortality as a function of age and sex. Each unit comprises an age/sex/RHA sub-segment of the population.
id year agegrp sex rha deaths pop 1 1985 1 1 Central 15 4110 1 1986 1 1 Central 10 4141 1 1987 1 1 Central 4 4095 1 1988 1 1 Central 8 4146 1 1989 1 1 Central 9 4021 … … … … … … … 240 1995 10 2 Win 39 23305 240 1996 10 2 Win 47 23660 240 1997 10 2 Win 46 23794 240 1998 10 2 Win 40 24029 240 1999 10 2 Win 37 24778
EXAMPLE
To describe annual changes in the mortality of the RHAs in Manitoba after controlling for age and sex differences in the population, we use Poisson regression to model counts of the numbers of deaths in each unit (i.e., age/sex/RHA sub-segment) as a function of the independent variables of age, sex, and RHA. The population size for each cluster is included as an offset (constant term) in the model.
proc genmod data =geedata;
class year agegrp sex rha id;
model deaths =year age sex rha year*rha /dist =p
link =log
offset =lpop
;
repeated subject =id /type =exch;
run;
Note: Lpop = log(population).
Now let us test the following hypotheses for the above model:
Hypothesis 1:
Is there a difference in the average mortality rate for 1995 to 1996 and 1998 to 1999 across the province? In other words, when age, sex, and region are held constant (or controlled for), are there differences in mortality between these two two-year time periods?
This corresponds to the following hypothesis:
H 0 : B 99 + B 98 = B 96 + B 95 or H 0 : (B 99 + B 98) - (B 96 +B 95 ) = 0
The coefficients that will be used to specify the L vector are therefore:
1999 1998 1997 1996 1995 1999 1 0 0 0 0 1998 0 1 0 0 0 A=(99+98) 1 1 0 0 0 Year=96 0 0 0 1 0 Year=95 0 0 0 0 1 B=(96+96) 0 0 0 1 1 C=(A-B) 1 1 0 -1 -1 The row C=(A-B) then gives the coefficients of the year variable (main effect) that will be in the ESTIMATE statement to be included in the above SAS program; just before the RUN statement. Thus, we have:
proc genmod data =geedata;
class year age sex reg id;
model deaths =year age sex reg year*reg /dist =p
link =log
offset =lpop
;
repeated subject =id /type =exch;
estimate 'Average diff. in rates between (98-99) and (95-96)' year1 1 0 -1 -1 / divisor =2;
run;
Note that we have used the option divisor=2 because we are interested in average of the two years. Also in the table above we arranged the years in descending order because we have assumed that in the SAS programming of the input dataset, the years have been programmed such that PROC GENMOD would treat 1995 as the reference category.
Hypothesis 2:
Is there a difference in mortality rates for males and females for 1995 to 1996 and 1998 to 1999 across the province? In other words, when age and region are held constant (or controlled for), are the differences in mortality for males and females different between these two two-year time periods?
This corresponds to the following hypothesis:
H 0 : (B M*99 + B M*98 ) - (B M*96 + B M*95 ) = (B F*99 + B F*98 ) - (B F*96 + B F*95 ) or
H 0 : (B M*99 + B M*98 ) - (B M*96 + B M*95 ) - (B F*99 + B F*98 ) + (B F*96 + B F*95 ) = 0
M*(Year=xx) is the male by the given year interaction and F*(Year=xx) is for the female by year interaction. The coefficients that will be used to specify the L vector are therefore:
M*99 F*99 M*98 F*98 M*97 F*97 M*96 F*96 M*95 F*95 M 1 0 1 0 0 0 -1 0 -1 0 F 0 1 0 1 0 0 0 -1 0 -1 Diff (M-F) 1 -1 1 -1 0 0 -1 1 -1 1 The following programming statements are used to create a test of this hypothesis:
proc genmod data =geedata;
class year age sex id;
model deaths =year age sex year*sex /dist =p
Link =log
Offset =lpop
;
repeated subject =id /type =exch;
Estimate 'Male to Female ratios (1999 - 1998) vs (1996 - 1995)' year*sex 1 -1 1 -1 0 0 -1 1 -1 1/divisor=2;
run;