# Methodological research (LEMMA 3)

Methodological research is being carried out in the following areas:

- Longitudinal models: a review and synthesis
- Multilevel models for within-subject variability and autocorrelation
- Causal longitudinal data analysis
- Missing data and data linkage

## Longitudinal models: a review and synthesis

We will review and synthesise a wide array of statistical methods for analysing longitudinal data . While there are many excellent text books in this area, the presentation in these books can sometimes be quite technical and less focussed on the needs of quantitative social scientists. On the other hand, books more suitable for social scientists tend to focus solely on the models used within a particular discipline. In our review, we adopt a multi-disciplinary perspective to bring out clearly the links between the different approaches and, where two approaches differ, explain these differences in terms that are substantively meaningful. This review-synthesis will be illustrated by applications from out substantive projects. Ultimately, we aim to guide users’ choices of longitudinal data analysis (LDA) to reflect their particular study designs and research questions.

## Multilevel models for within-subject variability and autocorrelation

Multilevel models for longitudinal repeated measures use a latent variable(s) to account for subject-specific differences in the *conditional mean* of the response given covariates. A standard assumption is that the within-subject variance is the same for all subjects. However, often this assumption will be unrealistic. For example, in a study of individuals’ annual incomes, individuals will vary not only in their mean incomes, but also in their year-to-year income variability; some individuals have more unstable incomes than others. This unrealistic assumption can be relaxed by modelling the variance as a function of covariates (Goldstein, 2003). For example, the variability in individuals’ income levels might differ for individuals working in different industries. A natural extension is to include a second latent variable to account for subject-specific differences in the within-subject variance of the response. In many social science applications, the within-subject variance is substantively interesting, but the approach has rarely been applied. A notable exception is a study of ecological momentary assessment (EMA) data on adolescents’ mood variability, where the degree to which adolescents are heterogeneous in terms of both their average mood levels and the variation in their moods is examined, along with how both dimensions of mood variability relate to covariates such as smoking behaviour (Hedeker et al., 2008). A second exception is a study of school effects on the mean and dispersion of mathematics achievement (Kasim & Raudenbush, 1998).

We will review this class of model for LDA, and provide guidelines for when these models not only address substantively interesting questions, but are also feasible to estimate. We will further develop this class of model by extending the principle of modelling the variance in terms of a latent variable(s) to models allowing additional levels of clustering, cross-classified designs, and also to generalised linear multilevel models. Tightly spaced repeated measures means that autocorrelated residuals are a common feature of EMA data, so we will additionally investigate extending this approach to modelling covariance and correlation parameters by building on previous work on correlated random effects (Browne & Goldstein, 2010). Another novel extension we will consider is using within-subject variance estimates to plug into a ‘second stage’ model for predicting individual outcomes. Such an approach parallels structural equation modelling (Muthén & Muthén, 2007) where a measurement model is defined for the latent construct(s) which is then used in the structural model to predict a distal outcome.

### References

- Browne, W.J. & Goldstein, H. (2010) MCMC sampling for a multilevel model with non-independent residuals within and between cluster units,
*Journal of Educational and Behavioural Statistics*, 35, pp.453-473. - Goldstein, H. (2003),
*Multilevel Statistical Models*(3rd ed.), London: Arnold. - Hedeker, D., Mermelstein, R.J. & Demirtas, H. (2008) An application of a mixed-effects location scale model for analysis of Ecological Momentary Assessment (EMA) data,
*Biometrics*, 64, pp. 627-634. - Kasim, R.M. & Raudenbush, S.W. (1998) Application of Gibbs sampling to nested variance components models with heterogeneous within-group variance,
*Journal of Educational and Behavioral Statistics*, 23, pp. 93-116. - Muthén, L.K. & Muthén, B.O. (2007)
*MPlus User's Guide*(Los Angeles, Muthén and Muthén).

## Causal longitudinal data analysis

Causal modelling of complex processes is essential to obtain a deeper understanding of these processes and how to effect change. More specifically, causal LDA is important for policy because causal estimates tell us what would happen if we implemented a policy, for example, the effect of extending tax credits on childbearing. The main impediment to causal LDA of survey data is selection bias. For a simple example of selection bias, suppose that we have two schools, A and B, and that we wish to estimate the effects of these schools on pupils’ examination performance; however, as parents with high socio-economic status are more likely to select school A than school B, the observed difference between the two schools confounds the school effect with the difference between the parents’ socio-economic status, and so the estimate is biased. Methods for causal LDA are all about adjusting for the effects of selection bias. In recent years, there have been many developments for causal inference from econometrics and biostatistics. We consider the use of these methods in mainstream social science research.

- Compare and develop methods for causal LDA from the interface between econometrics and statistics. This will involve a comparison between econometric fixed effects panel data models (e.g. Wooldridge, 2002) and structural/simultaneous equations models with random effects (e.g. Bollen & Curran, 2006). First, we will compare the application of both methods to scenarios involving few longitudinal measures (i.e. ‘short panels’) where theory suggests that the Generalized Method of Moments (GMM) for fixed effects panel models will be superior (Blundell & Bond, 1998). Second, we extend this comparison to multi-process models for the outcome
*and*time-varying covariates, where the parameters of the covariate process are of substantive interest. Random effects multi-process models have already been applied (e.g. Steele et al., 2007), but these models cannot be estimated using traditional GMM. As such, we will additionally develop effects GMM estimator for multi-process models using vector autogression (VAR) theory (Arellano, 2003). Third, we will compare how each handles the ‘initial conditions’ problem (e.g. Kazemi & Crouchley, 2006). - Recent developments from biostatistics concern marginal models for causal LDA. Unlike random effects models, marginal models are sometimes preferred as being less sensitive to distributional assumptions about the random effects (e.g. Robins et al., 1999). Marginal structural models (MSMs) were originally developed for the analysis of randomised controlled clinical trials, and are parameterised explicitly in terms of causal effects (Robins et al. 1999). MSMs are estimated using inverse probability weighted (IPW) estimators for MSMs under assumptions that relate directly to longitudinal data with time-varying covariates (e.g. Hernan et al., 2002). We will investigate the use of MSMs for our examples, with a view to these models being used more widely in quantitative social research. We will emphasize how these models can be used to clarify and weaken the usual assumptions required for causal analysis, and compare MSMs to the conditional modelling approaches introduced above.

### References

- Arellano, M. (2003)
*Panel Data Econometrics*(Oxford, Oxford University Press).

Blundell, R. & Bond, S. (1998) Initial conditions and moment restrictions in dynamic panel data models,*Journal of Econometrics*, 87, pp. 115-143. - Bollen, K.A. & Curran, P.J. (2006)
*Latent Curve Models: A Structural Equation Perspective*(Hoboken, New Jersey, John Wiley & Sons, Inc.). - Hernan, M.A., Brumback, B.A. & Robins, J.M. (2002) Estimating the causal effect of zidovudine on CD4 count with a marginal structural model for repeated measures,
*Statistics in Medicine*, 21, pp. 1689-1709. - Kazemi, I. & Crouchley, R. (2006) Modelling the initial conditions in dynamic regression models of panel data with random effects, in: B.H. Baltagi (Ed)
*Panel data econometrics: theoretical contributions and empirical applications*(Amsterdam, Elsevier). - Robins, J.M., Greenland, S. & Hu, F.-C. (1999) Estimation of the causal effect of a time-varying exposure on the marginal mean of a repeated binary outcome,
*Journal of the American Statistical Association*, 94, pp. 687-700. - Steele, F., Vignoles, A. & Jenkins, A. (2007) The effect of school resources on pupil attainment: a multilevel simultaneous equation modelling approach,
*Journal of the Royal Statistical Society, Series A*, 170(3), pp. 801-824. - Wooldridge, J.M. (2002)
*Econometric Analysis of Cross Section and Panel Data*(Cambridge, Massachusetts, The MIT Press).

## Missing data and data linkage

Missing data is a major challenge to the analysis of longitudinal data. Restricting analysis to the subsample for which all the time-varying and fixed variables in the analysis are observed (i.e. listwise deletion) has well-known disadvantages (Little & Rubin, 2002). The particular problems posed by longitudinal data come from low response rates in later waves and unstructured missing value patterns. Methods are based on assumptions about the unknown non-response mechanism through which the data came to be missing: the data are ‘missing at random’ (MAR) if the true probability of non-responding depends only on a subject’s observed characteristics (this includes the special case where the non-response process is completely random, namely, ‘missing completely at random’); otherwise the data are ‘missing not at random’ (MNAR) (Little and Rubin, 2002).

A highly flexible multiple imputation procedure has been developed that is suitable for imputing MAR data with unstructured missing value patterns, and for mixtures of continuous, binary and ordinal variables with multilevel structure (Goldstein et al., 2009; Carpenter et al., 2011). We will explore its use in LDA in two ways.

First, we will investigate the use of this multiple imputation to handle missing longitudinal data in all of our examples. In particular, for each application, we will compare its performance to the missing data adjustments implemented within the standard software used to fit the longitudinal model.

Second, we use it to address the problem of record linkage. The linking of records from disparate sources is becoming increasingly important with the availability of large, typically administrative, databases holding valuable additional information; this applies to longitudinal as well as cross-sectional data. Without unique and reliable subject identifiers, data linkage must be probabilistic and reflect the uncertainty associated with any given match (Jaro, 1995). We will use the multiple imputation method described above to account for linkage uncertainty. This work will be carried out in collaborations with, among other groups, the Institute of Child Health at UCL. Important methodogical issues about the quality of matched data will be considered.

### References

- Carpenter, J., Goldstein, H. & Kenward, M. (2011) REALCOM-IMPUTE for multilevel multiple imputation with mixed response types,
*Journal of Statistical Software*(in press; download from http://missingdata.lshtm.ac.uk/preprints/Carpenter2011.pdf). - Goldstein, H. (2010)
*Multilevel Statistical Models*(4th edition), Wiley. - Goldstein, H., Carpenter, J., Kenward, M.G. & Levin, K.A. (2009) Multilevel models with multivariate mixed response types,
*Statistical Modelling*, 9(3), pp. 173-197. - Jaro, M. (1995) Probabilistic linkage of large public health data files,
*Statistics in Medicine*, 14(5-7), pp. 491-498. - Little, R.J.A. & Rubin, D.B. (2002)
*Statistical Analysis with Missing Data*(Hoboken, NJ, Wiley).

Schafer, J.L. (1997)*Analysis of Incomplete Multivariate Data*(London, Chapman & Hall/CRC Press).

**Note:** some of the documents on this page are in PDF format. In order to view a PDF you will need Adobe Acrobat Reader