Long COVID burden and risk factors in 10 UK longitudinal studies and electronic health records



The UK National Core Studies—Longitudinal Health and Wellbeing program (https://www.ucl.ac.uk/covid-19-longitudinal-health-wellbeing/) combines data from several UK population-based LSs and records from electronic health (EHR ) to answer questions related to the pandemic. In this analysis, we pooled results from parallel analyzes within individual LSs and then compared them to population-based results from EHRs capturing people who were actively seeking health care.

To taste


Data was drawn from 10 UK LSs that had conducted surveys before and during the COVID-19 pandemic comprising five age-matched cohorts: the Millennium Cohort Study (MCS)28; the Avon Longitudinal Study of Parents and Children (ALSPAC (Generation 1, “G1”))29; Next Steps (NS)30; the 1970 British Cohort Study (BCS)31; and the National Child Development Study (NCDS)32and five age-heterogeneous samples were included: the Born in Bradford (BIB) study33; Understanding Society (USOC)34; Generation Scotland: The Scottish Family Health Study (GS)35; the parents of the ALSPAC-G1 cohort, which we call ALSPAC-G036; and the UK Adult Twin Registry (TwinsUK)37. Study details and references are presented in Supplementary Table 1. Minimum inclusion criteria were pre-pandemic health measures, age, gender, ethnicity plus self-reported COVID-19 and self-reported duration of COVID-19 symptoms. Ethics statements presented in Supplementary Table 2.

Electronic Health Records (EHR)

Working on behalf of NHS England, we conducted a population-based cohort study to measure long COVID registration in EHR data from primary care practices using TPP SystmOne software, linked to data from the Secondary Uses Service (SUS) (containing hospital records) via OpenSAFELY (https://www.opensafely.org/). It is a data analytics platform developed on behalf of NHS England during the COVID-19 pandemic to enable near real-time analysis of pseudonymised primary care records in the highly data environment. secure from the EHR provider to protect patient privacy. Details on information governance for the OpenSAFELY platform can be found in Supplementary Note 1. From a population of all people alive and registered with a general practice as of December 1, 2020, we have selected all patients who had evidence of a COVID-19 related code, either: positive SARS-CoV-2 test, being hospitalized with an associated COVID diagnosis code, or having a recorded diagnosis code for COVID in primary care .


Results: COVID-19 and long definitions of COVID

LS: COVID-19 cases were defined by self-report, including confirmation of testing and diagnosis by healthcare professionals (see Supplementary Data 1 for details of questions and coding used in each study ). Long COVID was defined according to NICE categories using duration of self-reported symptoms1. Based on these categories, we defined two primary outcomes: (i) symptoms lasting longer than 4 weeks (symptoms lasting 0 to 4 weeks as baseline) and (ii) symptoms lasting longer than 12 weeks (symptoms lasting 0 to 12 weeks as reference). Some studies recorded the duration of symptoms of any severity, while others only referred to symptoms impacting daily function (Table 2). In addition, two studies have derived alternative estimates of long COVID based on the number of individual symptoms lasting more than 4 or 12 weeks over at least six months (BiB, TwinsUK) (Supplementary Note 2). All data used to derive these results was collected between April and November 2020.

EHR: Any long COVID record in the primary care record was coded as a binary variable. This was defined using a list of 15 UK SNOMED codes, categorized as diagnosis (2 codes), referral3 and evaluationten codes. SNOMED is an international structured clinical coding system for use in the EHR38. These clinical codes were designed based on guidance issued on the long COVID by NICE1. The outcome was measured between the start date of the study (February 1, 2020) and the end date (May 9, 2021).


Sociodemographic factors

All studies included age, sex, ethnicity (white or non-white minority ethnic group, if applicable) and multiple deprivation index (MDI; divided into quintiles, 1 representing most deprived and 5 representing less deprived). Area-level SES was measured using the IMD 2019, a composite of different domains including income, employment, access to education, and area-level crime, for the postal code where a participant lived at the time of sample collection.39. LS included additional measures of socioeconomic position: education (degree, no degree) and occupational category of own current/recent job (Supplementary Data 1). The EHR also included the geographic region40.

Mental Health

LS: Pre-pandemic measures using validated continuous scales of anxiety and depression symptoms dichotomized using established thresholds to indicate distress (see Supplementary Data 1).

EHR: Evidence of a pre-existing mental health condition was defined using previous codes for one of the following: psychosis; schizophrenia; bipolar disorder; or depression.

Self-rated general health status

LS: Pre-pandemic self-assessment on a dichotomized 5-point scale to compare excellent-good health (grades 1-3) to fair-poor health (grades 4-5).

Overweight and obesity

LS: body mass index (BMI; kg/m2) obtained before the pandemic, coded to compare a BMI between 0 and 24.9 (underweight/normal weight) to a BMI ≥25 (overweight/obese).

EHR: classified as obese or non-obese using the most recent BMI measurement, with obese people further classified as obese I (BMI 30 to 34.9), obese II (BMI 35 to 39.9 ) or obese III (BMI 40+) . A BMI >25 was used in LS because the percentage of people in the obese category (i.e., BMI >30) was relatively low, for example, 8.9% for TwinsUK, while codes d EHR obesity were used because they are more reliable and valid indicators of being obese in general practice.

Health conditions

LS: Pre-pandemic self-assessment of asthma, diabetes, hypertension and hypercholesterolemia.

DSE: A code 6 months to 5 years prior to March 2020 for one or more of the following: diabetes; cancer; hematological cancer; asthma; chronic respiratory disease; chronic heart disease; chronic liver disease; stroke or dementia; other neurological condition; organ transplant; dysplasia; rheumatoid arthritis, systemic lupus erythematosus or psoriasis; or other immunosuppressive conditions. Those that had no relevant code for a condition were assumed not to have that condition. The number of conditions was classified into “0”, “1” and “2 or more”.

Health behaviors

LS: Current smoking status (dichotomized into “0” = no, “1” = yes).

Statistical analysis: LS

The primary analyzes were conducted in studies with a self-reported direct measure of duration of COVID-19 symptoms. Associations between each factor and the two long COVID outcomes (symptoms for 4+ weeks and symptoms for 12+ weeks) were assessed in separate logistic regression models within each study. We adjust for a minimal set of covariates in all studies, if any: age (adjusted as a continuous variable when considered a covariate), gender, and ethnicity. We report odds ratios (OR) and 95% confidence intervals (CI). To synthesize the magnitudes of association between studies, a fixed-effects meta-analysis with restricted maximum likelihood was performed and repeated with random-effects modeling for comparison. The I2 statistic was used to account for heterogeneity between estimates. Meta-analyses were performed using the metafor package41 for the R version 4).

Due to the different age structures of LS, examination of the direct relationship between age and long-term COVID risk was treated separately from other risk factors, and we modeled the relationship in two ways. First, in age-heterogeneous samples, we compared long-term COVID risk across prespecified age categories, comparing 45–69 and 70+ to 18–44 across three cohorts (USOC, TwinsUK and GS) , and 55–59 and 60–76 to 45–54 years in one cohort (ALSPAC G0). Second, in a subset of LS birth cohorts with participants of nearly identical ages and who received fully harmonized long COVID questionnaires (MCS, NS, BCS70, and NCDS), we analyzed the trend in absolute risk of COVID along with increasing age between studies using meta-regression.

Attrition and survey design were addressed by weighting the estimates to be representative of their target population in each LS (weights were not available for BiB and TwinsUK).

Sensitivity analyzes

To mitigate index event bias27, IPW were derived for the risk of COVID-19. These were derived in each LS separately but following a common approach used previously (see Supplementary Note 3 for details)42. The derived weights were then applied in all analysis models as a sensitivity control.

For studies in which we were able to verify SARS-CoV-2 infection (TwinsUK and ALSPAC-G0 and -G1), analyzes were repeated on the subsample of those who had a polymerase chain reaction (PCR) positive obtained by binding with data tests and/or lateral flow antibody test (ALSPAC) and enzyme immunoassay (ELISA) (TwinsUK)43 results confirming viral exposure. These results are shown in Supplementary Figs. 11–14.

Statistical analysis: EHR

We performed logistic regression to assess whether long COVID recorded by a GP was associated with each sociodemographic or pre-pandemic health characteristic. We adjusted for the same set of confounders used in the LS analyses: age (as a categorical variable), gender, ethnicity.

In further analyzes of age as a risk factor for long COVID in EHR data, we assigned individuals in 10-year categories an age in the middle of each group and then assessed the trend in frequency of long COVID with age using linear and nonlinear meta-regression.

Summary of reports

Further information on the research design can be found in the summary of nature research reports linked to this article.


Comments are closed.