Exploration of ailments related to back pain in the medical history data from clinical studies of healthy subjects using formal hypothesis testing and data mining

Lead Investigator: Harry Southworth, Data Clarity Consulting Ltd
Title of Proposal Research:  Exploration of ailments related to back pain in the medical history data from clinical studies of healthy subjects using formal hypothesis testing and data mining
Vivli Data Request: 7598
Funding Source: None
Potential Conflicts of Interest: We see no conflicts of interest. The proposal is to do some basic scientific research. There is no commercial product. For completeness, though, please see below, Dr. Schubiner reports:
Owner, Mind Body Publishing, Inc. Book royalties for Unlearn Your Pain and Unlearn Your Anxiety and Depression.
Co-Owner, Psychophysiologic Press, Inc. Book royalties for Hidden From View.
Co-owner, Freedom From Chronic Pain, Inc. (Australia). Online pain recovery program.
Co-owner, OVID DX. Mobile educational app for diagnosis and treatment of chronic pain.
Consultant, United Health Group Research and Development. Model pain clinic in Las Vegas.
Consultant, Karuna Labs. Online pain recovery program.
Consultant, Curable Health. Online pain recovery program.

Summary of the Proposed Research:

According to a 2018 report, 20.4% of adults in the USA had chronic pain in the 2016 study period, and the associated economic cost, including additional health care costs, is estimated to be between $560 and $635 billion per annum.A systematic review published in 2016 estimated the prevalence of chronic pain in UK adults to be as high as 40.3%, and the economic cost of back pain alone is estimated to be £21 billion per annum.

There is a growing literature on overlap between common chronic pain syndromes, in which irritable bowel syndrome and chronic fatigue syndrome are included.

The popular literature on pain syndromes includes many claims that chronic pain is usually psychosomatic in nature: that is, that the mind and body interact to create or worsen the condition. Some go so far as to claim that hay fever, acid reflux, migraine, eczema, tinnitus, frequent urination and other common ailments are also psychosomatic in nature and are related to chronic back pain. That is, the number of conditions claimed to overlap is larger that apparently recognized in the recent scholarly literature and goes considerably beyond just pain syndromes.

Whilst the claim that ailments in such diverse body systems are related and neurological in origin may appear unlikely, a 1998 study found that degree of childhood abuse was predictive of depression, suicidality, heart disease, cancer, lung disease, skeletal fractures and liver disease. It might, therefore, be the case that normal levels of trauma are associated with lesser, but important, morbidities.

The proposed research aims to use existing medical history data from large clinical trials of relatively healthy people to investigate if back pain is related to hay fever, acid reflux, migraine and other conditions, to either negate or effectively prove the relationships.

Should it be found that there are relationships between pain, allergies, gastrointestinal and central nervous system conditions, it cannot be proved that the causes of those conditions are sometimes neurological or the result of normal levels of trauma encountered in life. However, in the absence of an alternative credible overarching explanation that links all those body systems together, neurological origin becomes a strong candidate for patients and physicians to consider.

 
Statistical Analysis Plan:

For ease of notation, we label the studies as follows:

Study 1: Herpes zoster, over 50s, N = 16k: NCT01165177
Study 2: Herpes zoster, over 70s, N = 15k: NCT01165229
Study 3: Influenza, over 65s, N = 44k: NCT00753272
Study 4: High C-Reactive Protein + low low density lipoprotein cholesterol, over 50s, N = 18k: NCT00239681
Study 5: High cardiovascular risk, N = 31k: NCT00153101

The studies were chosen to be large enough to provide a good chance of finding relationships in the medical history data, and to include relatively healthy subjects. The main author worked on Study 4 in the past and is confident that the medical history data were coded through a medical dictionary, potentially making this study very valuable (we don’t know if the other Studies coded the medical history). The subjects in Study 5 are less healthy than in the other studies and the medical history data might be dominated by recent events, risking dilution of signal. However, the definition of “high risk” in cardiovascular studies is often not that high (for ethical reasons relating to the number of treatments available), and the sample size is very large. As described below, we propose to hold Study 5 out until the other Studies have been analysed.
Irrespective of whether the medical history data have been coded through a medical dictionary, certain terms may be pooled. For example, should the terms “hay fever”, “pollen allergy” and “seasonal rhinitis” all be present, they will be taken to be the same condition. It is not possible to prespecify the precise pooling without having access to the relevant dictionaries, so the pooling will be decided by the study medic, after a first screening by the study statistician.
If the medical history terms have not been coded through a medical dictionary, text that apparently refer to similar conditions will be identified and grouped. An approximate global search for regular expressions procedure (grep) will be used to ensure misspellings and variants are identified. For example, an approximate grep of “pollen allergy”, “hay fever” and “seasonal allergy” will be used to identify candidates for grouping, and these candidates will then be screened to identify and exclude inappropriate terms.
The events of primary interest are migraine, hay fever, acid reflux, eczema, tinnitus and frequent urination.
The data will be reduced to 0 or 1 representing absent or present in each subject’s medical history. The resulting analysis data will be a rectangular array of 0s and 1s with one column for each event of interest and one row for each study subject. The numbers of these events will be tabulated according to whether the subjects had back pain in their medical history.
If some of the frequencies of some events are low, the usual chi-squared test might not be appropriate. Given the large sample sizes, the approximate Gaussian distribution of the marginal estimates of log odds-ratios in a bias-reduced logistic regression (Firth, 1993; Heinze & Schemper, 2002; King and Zeng 2001) will be used as the test statistic. For the lowest frequency event, the model will be fit to at least 10,000 bootstrap samples, stratified by the event of interest, and the distribution of the test statistic will be examined and compared to the theoretical asymptotic Gaussian distribution. If the Gaussian assumption is grossly violated, the other events will be re-evaluated via a similar stratified bootstrap.
The relative risks of the events of primary interest will be presented with approximate confidence intervals, as well as the odds ratios and approximate confidence intervals, and the associated p-values from the bias-reduced logistic regression. Relative risks and their approximate confidence intervals will be estimated using the Jeffreys prior: 1/2 will be added to each cell in each 2-by-2 table (see Davis and Southworth, 2015, for a discussion).
No adjustments for multiple comparisons will be made. The events of primary interest have been chosen as examples in separate body systems and will be treated as representing logically independent questions.
A p-value less than 0.05 will be considered to represent modest evidence of a relationship between back pain and the other events. A p-value less than 0.01 will be considered to represent fairly strong evidence. A p-value less than 0.001 will be considered to be very strong evidence.
We will begin by excluding Study 5 on the grounds that the patients are quite unwell and there is risk of events in their recent medical history drowning out the signal.
These somewhat formal hypothesis tests will be conducted firstly in any studies for which the medical history has been coded through a medical dictionary, starting with the largest such study (excluding Study 5), then the next largest. If none of the studies have coded medical history data, the largest of all studies will be used to test the hypotheses first, then the next largest. The remainder of the studies will not be used for hypothesis testing at this stage.

Next, following Southworth and O’Connell (2009), all medical history data for each study in turn will be structured as a binary array with one row for each subject and one column for each event. A gradient boosted logistic regression (Hastie et al, 2009) will be used to model the probability of back pain as a function of all other events. 10-fold cross-validation will be used to determine when to stop iterating, and then the model will be refit to all the data. Supposing the cross-validation error goes down, the relative influence will be estimated from the model and the relative risks of the most influential medical history terms, possibly after discarding apparent junk and after pooling of similar terms, in the subjects with and without back pain will be computed, together with approximate 95% confidence intervals, using the Jeffreys prior.

The data mining part of the analysis will be kept open to other approaches. For example, elastic net with stability selection might be considered.

The first of the 2 studies used for hypothesis testing will be used for data mining first. Any terms which appear to be related to back pain but might be diluted due to synonyms will be identified and the pooling approach described for the predefined terms will be used and the model refit.
Any candidates for additional hypothesis testing will be identified, pooling of terms will be performed in the remaining studies, the data mining approach will be run on those studies, and the results across studies will be compared.
The data mining will also be performed for males and females separately because any terms that can only be reported by one sex will obviously be diluted in the data and might not be identified when data mining includes all subjects.
Finally, Study 5 will be used to compute the relative risk (and p-values, etc) for all terms that appear to be related to back pain. Should there be noticeable disagreement with the other studies, explanations will be sought and proposed.

Software

R version 4.1.1 or higher, as provided by Vivli, will be used.

For bias reduced logistic regression, the brglm2 package will be used if made available. Otherwise brglm will be used.

For gradient boosting, the xgboost package will be used.

Custom code will be used for computing relative risks and associated confidence intervals.

The tidyverse suite of packages will be used for various aspects of data restructuring and plotting results.

The rmarkdown package will be used for writing up the results.

If it is possible to install packages from GitHub, personal code libraries will be used as a means of simplifying the coding of some parts of the analysis. Specifically, harrysouthworth/xgbm (some wrappers providing a more R-like interface to xgboost) and possibly some code to aid in restructuring of the data.

Requested Studies:

Efficacy, Safety, and Immunogenicity Study of GSK Biologicals’ Herpes Zoster Vaccine GSK1437173A in Adults Aged 50 Years or Older
Data Contributor: GlaxoSmithKline
Study ID: NCT01165177
Sponsor ID: 110390

Efficacy, Safety and Immunogenicity Study of GSK Biologicals’ Herpes Zoster Vaccine GSK1437173A in Adults Aged 70 Years or Older
Data Contributor: GlaxoSmithKline
Study ID: NCT01165229
Sponsor ID: 113077

Observer-blind Superior Efficacy Trial With GlaxoSmithKline Biologicals’ Influenza Vaccine GSK2186877A in Elderly Subjects
Data Contributor: GlaxoSmithKline
Study ID: NCT00753272
Sponsor ID: 106372

JUPITER – Crestor 20mg Versus Placebo in Prevention of Cardiovascular (CV) Events
Data Contributor: AstraZeneca
Study ID: NCT00239681

Public Disclosures:

Southworth H., Schubiner H. Statistical investigation of comorbidities of back pain in the medical history data from clinical trials of healthy subjects. 2024. Open Science Framework (OSF). Doi : 10.17605/OSF.IO/APXUT