Reporting and Handling Missing Data in Longitudinal Studies of the Elderly Is Suboptimal: A Methodological Survey of Geriatric Journals | BMC Medical Research Methodology


This review shows that the reporting and handling of missing data in longitudinal studies of older adults is suboptimal. Insufficient and unclear reporting, exclusion of participants with missing data, failure to assess the robustness of missing data results are still common practices. In general, the recommended guidelines for reporting and handling missing data are poorly adhered to. This is consistent with other reviews of missing data across different research designs and clinical areas [12, 21,22,23,24]. Since all articles included in this review were published at least more than 5 years after the publication of these guidelines, it was expected that reporting standards would have improved over time. Journal endorsement of guidelines could improve standards compliance [25]but only four of the ten included reviews mention any of the guidelines in their instructions to authors.

In some of the included studies, there was no indication whether data were missing or fully observed. Similar to previous reviews [13, 23], how the analytic cohort was selected, and the amount of missing data were unclear, particularly in retrospective cohort studies. In the absence of comments about the missing data, the reader can probably assume that the data was complete, which may or may not be true. Leaving room for speculation is not transparent reporting and undermines critical evaluation and reproducibility of the study. With the average proportion of missing data of 14% observed in the studies where they were reported, it seems that longitudinal studies with the elderly are likely to contain a large amount of missing data which is not negligible.

When there are missing data, the common practice for dealing with them is full case analysis, in which individuals with incomplete observations are removed. Methodological reviews of missing data since 2004 have consistently reported similar results [14, 16,17,18, 21, 23, 26]. The continued use of this method may reflect its ease and simplicity, as well as the fact that it is the default approach in most mainstream statistical software. [5, 23]. Since there is no built-in mechanism to report missing data in these apps, it can go unnoticed. Therefore, performing exploratory analysis to understand the extent of missing data is an important part of the first step in data analysis to solve the missing data problem.

When a full case analysis is used, the underlying assumption is that the missing data is MCAR, implying that the missing is unrelated to observed or unobserved data [5, 7]. Simply put, the fully observed sample is always representative of the population under study. [5]. This hypothesis is plausible when the amount of missing data is minimal [13]. In the presence of a large proportion of missing data, the resulting estimates will not only be inefficient but could be biased. [7, 15]. In some of the studies, exclusion of participants with missing data occurred during the initial phase of study inclusion. That is, the fully observed dataset reported in these studies was due to certain eligibility criteria that defined the sample based on the completeness of the data; potentially to avoid the problem of missing data. The majority were retrospective cohort studies where a subset of the original population was used. Exclusion of participants due to missing data at any phase will have the same potential for bias if groups with or without complete data differ systematically [15].

In the context of longitudinal studies of the elderly, the use of a full case analysis to deal with missing data may produce biased estimates. With extended observation time and multiple waves of data collection, the missing data is unlikely to be MCARs. Elderly participants are at increased risk of events such as poor or compromised health, hospitalization, institutionalization, and death, which limit their ability to return for follow-up assessment or respond to surveys over time [3, 27]. Therefore, selective attrition may occur, where healthier older people are more likely to remain at the end of the study. [4]. For example, frail older people are vulnerable to adverse events [28]; therefore, they are less likely to be available to perform study assessments, including frailty measures. In this case, having missing data for frailty or other measures may be a function of a participant’s frailty. Therefore, MAR or MNAR are plausible assumptions to make.

Although it is not possible to categorically prove the lack mechanisms at play in a dataset [5], few evaluations could guide our hypotheses. A comparison of the baseline characteristics of those with and without complete data could indicate whether the lack depends on the observed variables, whether the two groups differ significantly [13]. Other methods include Little’s MCAR test [29] or logistic regression to determine variables associated with indicators of missing data [30]. However, with MNAR, it will be impossible to perform evaluations for unobserved data. Hypotheses are usually based on a priori biological, clinical or epidemiological knowledge and the reasons for missing data [15]. Hypothesis assessments were infrequent in this review as in other reviews of observational studies [13, 23]. Regardless of the hypothesized lack mechanism or the methods used, it is important to examine the robustness of the results to different hypotheses and alternative methods. [5, 11]. We found that such sensitivity analysis has only been performed in a limited number of studies.

In some of the studies reviewed, the primary analysis involved methods such as survival analysis that treat incomplete outcome data differently. In the majority of these studies, there was no mention of missing data and how they were handled. When participants have unobserved outcome data in the survival analysis, it is usually processed through censoring, where the available data is used until the last observation [31]. This method can bias results when censoring is informative, i.e. censored participants have a higher or lower risk of experiencing the outcome [3]. Additionally, this could be problematic when dealing with missing covariate values ​​in the presence of time-dependent variables and time-varying effects or when evaluating proportional hazards assumptions. [23]. Carol et al. [23] provide detailed descriptions for dealing with missing covariate data when using survival analysis in observational studies.

Multiple imputation has been used in very few studies to deal with missing data despite its popularity and availability in traditional statistical software packages [5, 18]. This method is based on the MAR hypothesis which is considered valid in many longitudinal data contexts [32]. This involves reproducing multiple complete versions of the original dataset by replacing missing observations with plausible values ​​and then combining them into a single result. [7, 33]. Multiple imputation reflects uncertainty around the prediction of missing data compared to single imputation which ignores variability around imputed estimates [6]. Unlike full case analysis, this method allows all available data to be used, minimizing the loss of precision and power [6, 13].

When multiple imputation is used, existing guidelines recommend describing the elements of the procedure to facilitate review [6], but these details were sparsely presented in the reviewed studies. The only study [34] in which the method was described in detail used an online supplementary file for this purpose. This allowed a comprehensive assessment of the treatment of missing data in this study. The online supplemental files provide ample space to report additional study details that could not be presented in the main article due to word or page limitations. However, its use to present information about missing data is rare, with only 3% of studies referring to it in the main text.

In situations where the data is MNAR, i.e. the probability of missing depends on the unobserved data [7]; modeling missing data becomes more difficult and requires more sophisticated techniques. For example, a study that examined the association between cognitive decline and mobility in living space among community-dwelling older adults used a model-mixed model to account for nonignorable probable lack [35]. In the study, participants who dropped out had lower scores on predictors, intermediate and predicted variables than those who stayed, suggesting non-random absence. The model mixture model was used to model the lack and response of participants in each model of missing data [36]. The selection model could also be used to handle non-random missing data by modeling the probability of participant responses and missing values ​​based on a common selection factor. [32].


Our investigation is limited by several restrictions applied in the search. It is possible that some studies were omitted due to search limitations to a few general geriatric journals. In addition, we randomly selected studies for data abstraction because it was impractical to include all eligible articles. Nonetheless, we expect the practices described in this review to provide a snapshot of actual practice across the field. Additionally, we did not exclude studies from the same cohort of participants, so there may be duplication or overlap of data or reporting, particularly for retrospective cohort studies. Because this review was limited to observational studies of older adults, current practice of handling and reporting missing data for other research designs, such as randomized controlled trials, may differ.


Comments are closed.