Concept: Statistics Canada Linked Data: Notes from the Field
Last Updated: 1999-11-10
The most important part of working with this particular linked dataset, and probably datasets in general, is understanding what the variables mean and how they are coded. This is aided by studying the codebook, where available, and by running frequency tables of categorical and ordinal variables and means/medians of continuous variables.
The codebook describes (or should describe) the name of each variable, what it is supposed to measure, and the number of levels or range of the values the variable takes on in the dataset. This will tell you, for example, if sex is coded as M and F, or 0 and 1, or 1 and 2, or 1, 2 and 9 etc. The codebook for the linked Census data tells you that the income variables actually refer to 1985 income, even though the Census was taken in June of 1986. This is important to keep in mind when analyzing the data.
One-way or two-way frequency tables not only give information on how the variables are distributed, but also, like the codebook, can tell you how many levels each categorical variable has (and if this differs from the codebook). The presence of missing values can be identified at this stage. Further investigation can reveal if these are "true" missing values, i.e. the data are simply not available, or if the value is missing because the question is not applicable to the situation (e.g. an occupation code for a retired person). It is hoped that a category such as "Not Applicable" is reserved for the latter case.
For the Statistics Canada dataset, another source of information is the Census form itself. Having access to the actual questions asked during the Census can be helpful in interpreting more complex concepts, such as employment status or living arrangements. Several questions related to employment were asked, each one revealing a different aspect of the respondent's situation. Identifying those who met our definition of "unemployed", not currently working but able to work and actively seeking employment, required combining the responses to two questions into one composite variable. Some concepts, such as "living with someone employed full-time", required a search of all records in each household before the variable could be defined.
The information from the Census was split into several files. Household-specific information was stored in a separate file, and person-specific information in another. This was done to avoid needless repetition of the household variables on each record in the person-based dataset. In practice, the two files are merged together using common identifiers (Enumeration Area and household number) before use. After the merge, concepts such as Urban/Rural residence, owner-occupied vs. rental property and type of heating can be used.
Sample Weights and Clustered Data
The Census information comes from the long form of the Census, the so-called 2B form. Because the 2B is only distributed to one in five households, it is technically a sample, though a very large one. From this sample, another sample was drawn, which was linked to the Manitoba Health data. For this reason, sampling weights are present on the linked dataset. These weights reflect the different probabilities of selection of each individual in the province. Any analysis done on the data should use these weights so the resulting means and percentages reflect the whole population, not just this particular sample. Over-sampling of certain groups likely to be under-represented in a simple random sample is often done to give a sample size large enough for analysis. If the data are analyzed un-weighted, these over-represented groups will have a larger influence than they actually do in the population.
In practice, a "normalized" version of the sampling weights is used. The weights are normalized by dividing the weight for each person by the mean weight over the whole sample. This ensures that when the weights are applied, the total N in the denominator does not change (because the mean of the normalized weights is equal to one). If the actual sampling weights were used, the denominator would approximate the population from which the sample was drawn (in this case Manitoba), and the variances would be calculated based on a "sample size" of roughly 1 million. This clearly would wreak havoc with the standard errors, making them far too low.
The sampling techniques used also violate one of the central assumptions of standard statistical packages: the simple random sample (SRS). Hard-coded into the variance formulae of these packages is the assumption that each member of the population has the same probability of being selected into the sample. When this is not the case, as it is not when cluster sampling and stratification are used, the variances must be computed differently, usually with more complicated methods. Packages such as SUDAAN and WesVarPC can be used to analyze data under such circumstances.
For the Statistics Canada dataset, early tests of the impact of the sampling scheme showed that it did not affect the variances enough to give up the greater flexibility of a standard package like SAS. This is not just a fortunate accident. This particular sample was deliberately chosen to be "self-weighting", that is, to have weights that roughly correspond to simple random sampling. Further, the sample was large enough (~16,000 households) to alleviate concerns about limited degrees of freedom.
In a simple random sample, each individual represents one degree of freedom. This arises from the assumption of an equal probability of selection for each member of the population. More complicated sample designs first subdivide the population into geographic strata, and then a number of Primary Sampling Units (PSU) are defined within each stratum. A number of people (some or all) are chosen from each PSU to make up the final sample. Because of the stratification and clustering within PSUs, the resulting sample is more likely to contain groups of people similar to each other on important attributes related to health. It is this similarity that causes intra-cluster correlation, which effectively reduces the degrees of freedom, and thus the analytical power, of a complex survey design relative to an equal number of people selected under SRS. Even in national surveys of thousands of people the number of PSUs can be less than 100, which affects greatly the variance of the estimates.
It should be noted that the latest release of SUDAAN offers many more analytical options than previous versions, so there is less of an incentive for using standard packages to analyze clustered data. On the other hand, SAS is adding limited ability to handle clustered data in its next release, so the debate is not over between these two packages.
Characteristics of the Data
Distributional assumptions are often violated in practice. Employment income, for instance, has a very peculiar distribution. Even after selecting out those not in the labour force and those outside the age range of employed persons, there are still problems with the distribution that make it difficult to work with in, for example, a linear regression.
Tricks like log transforming the data to make the distribution more symmetrical do not seem to work. The resulting distribution is still non-Normal. The distribution is, for the most part, truncated at zero, although there are a few percent with negative income. This was handled by taking the absolute value of income, with the reasoning being that it is only the truly well off who could afford to lose $50,000 in a single year.
A special type of regression known as Tobit regression (named after its developer, Tobin) was used to handle the truncation at zero. This was very important when analyzing social support income, defined as the sum of Canada Pension and "income from other Government sources", because few people reported having any of this income in 1985. Tobit regression assumes the data are censored above or below certain values, in this case below zero. The technique is similar to survival analysis in that it uses a censoring value to stand in for an indeterminate quantity. In fact, it is run using one of the procedures in SAS designed for analysis of survival data.
The parameter estimates resulting from a Tobit regression are not interpretable in the same way as those from a linear regression. Techniques are available which decompose the parameter estimates into two parts: the effect on the probability of the income being above zero, and the effect on the mean income given that it is above zero. Such a decomposition was used to estimate the effect of specific medical conditions on social support income.
The linked sample contains about 44,000 people, or about 4% of the 1986 population, which is more than sufficient for running analyses. When finely subdivided, however, such as when studying individual occupational categories, the small number in each group may be insufficient. It can be difficult, in other words, to determine if working-age male farmers have more digestive disorders than those in other occupational groups.
A Powerful Dataset
In spite of these problems, the dataset is very powerful. Medical utilization over time can be correlated with demographic characteristics only available from the Census. This allows study of, among other things, the link between unemployment and health, suicidal behaviour, and treatment for mental health conditions. Other study designs commonly used to look at unemployment, such as factory closure studies, are usually limited to a very specific subset of the population, and it can be difficult or impossible to control for pre-closure health status or level of health care utilization.