Concept: Generalized Estimating Equations (GEE)

Introduction

Carriere et al. (2000)

Across Time & Space

Carriere et al. (2000)

Generalized Linear Models and GEE

The distribution of the dependent variable, which must be a member of the exponential family
The link function
The independent variables

Distribution Link Function

Normal Identity

Poisson Log

Negative Binomial Log

Liang and Zeger (1986)

The distribution of the dependent variable (which must be a member of the exponential family)
The link function
The independent variables
The covariance structure of the repeated measurements.

Clustering

Analyzing Longitudinal Data Using SAS

Example:

This fictitious data contains counts of the number of deaths and population size in each Regional Health Authority (RHA) across successive years. The purpose of the subsequent analysis is to model mortality as a function of age and sex. Each unit comprises an age/sex/RHA sub-segment of the population.

id year agegrp sex rha deaths pop

1 1985 1 1 Central 15 4110

1 1986 1 1 Central 10 4141

1 1987 1 1 Central 4 4095

1 1988 1 1 Central 8 4146

1 1989 1 1 Central 9 4021

… … … … … … …

240 1995 10 2 Win 39 23305

240 1996 10 2 Win 47 23660

240 1997 10 2 Win 46 23794

240 1998 10 2 Win 40 24029

240 1999 10 2 Win 37 24778

Hypothesis Testing and Model Specification

₁

₂

₀

₁

₂

₀

₁

₂

₀

₁

₀

EXAMPLE

To describe annual changes in the mortality of the RHAs in Manitoba after controlling for age and sex differences in the population, we use Poisson regression to model counts of the numbers of deaths in each unit (i.e., age/sex/RHA sub-segment) as a function of the independent variables of age, sex, and RHA. The population size for each cluster is included as an offset (constant term) in the model.

proc genmod data =geedata;
class year agegrp sex rha id;
model deaths =year age sex rha year*rha /dist =p
link =log
offset =lpop
;
repeated subject =id /type =exch;
run;

Note: Lpop = log(population).

Now let us test the following hypotheses for the above model:

Hypothesis 1:
Is there a difference in the average mortality rate for 1995 to 1996 and 1998 to 1999 across the province? In other words, when age, sex, and region are held constant (or controlled for), are there differences in mortality between these two two-year time periods?

This corresponds to the following hypothesis:
H₀ : B₉₉ + B₉₈ = B₉₆ + B₉₅ or H₀ : (B₉₉ + B₉₈₎ - (B₉₆ +B₉₅ ) = 0

The coefficients that will be used to specify the L vector are therefore:

1999 1998 1997 1996 1995

1999 1 0 0 0 0

1998 0 1 0 0 0

A=(99+98) 1 1 0 0 0

Year=96 0 0 0 1 0

Year=95 0 0 0 0 1

B=(96+96) 0 0 0 1 1

C=(A-B) 1 1 0 -1 -1

The row C=(A-B) then gives the coefficients of the year variable (main effect) that will be in the ESTIMATE statement to be included in the above SAS program; just before the RUN statement. Thus, we have:

proc genmod data =geedata;
class year age sex reg id;
model deaths =year age sex reg year*reg /dist =p
link =log
offset =lpop
;
repeated subject =id /type =exch;
estimate 'Average diff. in rates between (98-99) and (95-96)' year1 1 0 -1 -1 / divisor =2;
run;

Note that we have used the option divisor=2 because we are interested in average of the two years. Also in the table above we arranged the years in descending order because we have assumed that in the SAS programming of the input dataset, the years have been programmed such that PROC GENMOD would treat 1995 as the reference category.

Hypothesis 2:
Is there a difference in mortality rates for males and females for 1995 to 1996 and 1998 to 1999 across the province? In other words, when age and region are held constant (or controlled for), are the differences in mortality for males and females different between these two two-year time periods?

This corresponds to the following hypothesis:
H₀ : (B_M*99 + B_M*98 ) - (B_M*96 + B_M*95 ) = (B_F*99 + B_F*98 ) - (B_F*96 + B_F*95 ) or
H₀ : (B_M*99 + B_M*98 ) - (B_M*96 + B_M*95 ) - (B_F*99 + B_F*98 ) + (B_F*96 + B_F*95 ) = 0

M*(Year=xx) is the male by the given year interaction and F*(Year=xx) is for the female by year interaction. The coefficients that will be used to specify the L vector are therefore:

M*99 F*99 M*98 F*98 M*97 F*97 M*96 F*96 M*95 F*95

M 1 0 1 0 0 0 -1 0 -1 0

F 0 1 0 1 0 0 0 -1 0 -1

Diff (M-F) 1 -1 1 -1 0 0 -1 1 -1 1

The following programming statements are used to create a test of this hypothesis:

proc genmod data =geedata;
class year age sex id;
model deaths =year age sex year*sex /dist =p
Link =log
Offset =lpop
;
repeated subject =id /type =exch;
Estimate 'Male to Female ratios (1999 - 1998) vs (1996 - 1995)' year*sex 1 -1 1 -1 0 0 -1 1 -1 1/divisor=2;
run;

Concept: Generalized Estimating Equations (GEE)

Concept Description

Introduction

Generalized Linear Models and GEE

Clustering

Analyzing Longitudinal Data Using SAS

Hypothesis Testing and Model Specification

Related concepts

Related terms

References

Keywords

Distribution	Link Function
Normal	Identity
Poisson	Log
Negative Binomial	Log

id	year	agegrp	sex	rha	deaths	pop
1	1985	1	1	Central	15	4110
1	1986	1	1	Central	10	4141
1	1987	1	1	Central	4	4095
1	1988	1	1	Central	8	4146
1	1989	1	1	Central	9	4021
…	…	…	…	…	…	…
240	1995	10	2	Win	39	23305
240	1996	10	2	Win	47	23660
240	1997	10	2	Win	46	23794
240	1998	10	2	Win	40	24029
240	1999	10	2	Win	37	24778

	M*99	F*99	M*98	F*98	M*97	F*97	M*96	F*96	M*95	F*95
M	1	0	1	0	0	0	-1	0	-1	0
F	0	1	0	1	0	0	0	-1	0	-1
Diff (M-F)	1	-1	1	-1	0	0	-1	1	-1	1