Does scoring method impact estimation of significant individual changes using Patient-Reported Outcomes Measurement Information System (PROMIS) measures? Comparing Classical Test Theory versus Item Response Theory

Lead Investigator: Xiaodan Tang, Northwestern University
Title of Proposal Research: Does scoring method impact estimation of significant individual changes using Patient-Reported Outcomes Measurement Information System (PROMIS) measures? Comparing Classical Test Theory versus Item Response Theory
Vivli Data Request: 8325
Funding Source: None
Potential Conflicts of Interest: None

Summary of the Proposed Research:

The importance of collecting patient-reported outcome measures (PROMs) has been recognized in medical research and clinical trials. In clinical research and care, researchers and clinicians have increasingly used PROMs to understand individual changes responding to treatment over time in patient-reported health-related quality of life (HRQoL) outcomes in order to evaluate treatment effectiveness for individual patients.

Many PROMs rely on raw summed scores of item response as scales scores, following a classical test theory (CTT) framework. In recent decades, item response theory (IRT), which originates from a modern measurement framework, has gained popularity as an alternative to simple summation of scores. IRT has the ability to capture different response patterns and to address skewness of population distribution that cannot be easily solved by using CTT.

Although advantages and weaknesses of Classical Test Theory (CTT) and Item Response Theory (IRT) scores have been discussed thoroughly in previous literature, few studies have examined whether the two frameworks differ in their ability to detect individual change over time using observed (empirical) data. This study will provide evidence-based guidance in detecting individual changes based on CTT and IRT scores under various measurement conditions and leads to recommendations for identifying responders to treatment for participants in clinical trials.

We will use a modelled and a clinical trial dataset to examine the ability of classical test theory (CTT) and item response theory (IRT) scores in identifying significant individual changes. A modeled dataset refers to simulation data, which are not collected from subjects. The clinical trial dataset refers to the STAR trial data. We will use both modeled/simulated data and the STAR trial dataset in this study. Classical Test Theory calculates a raw summed score of the scale scores for a measure. For this reason, each item is accounted for equally in the scale score. Item Response Theory estimates item parameters of each item and calculates the scale score for a measure. By doing so, each item can be assigned with different weights and contributes differently to the scale score. IRT has the ability to capture different response patterns and address the skewness of population distribution, which cannot be easily solved by using CTT. IRT has been applied in the construction and validation of several patient-reported outcome measures. This study further examines the benefits of IRT in terms of its capability of identifying responders to treatments, which is important to all clinical trials. By doing so, it would provide more confidence for researchers and practitioners to apply IRT rather than CTT because IRT can provide more information in terms of correctly identifying more responders to treatment.

Statistical Analysis Plan:

Reliable Change Index (RCI) for Classical Test Theory (CTT) and Item Response Theory (IRT) scores will be first computed using the traditional formula in both simulated and empirical datatsets, which was derived based on CTT. We will then use IRT-based RCI formula to compute an IRT-specific RCI which accounts for IRT standard error. After the RCI values of CTT and IRT scores have been calculated for each respondent, the improvement grouping of respondents will be identified with RCI values of higher than 1.96, 1.00, 0.67 corresponding to three confidence intervals: 95%, 68%, and 50%. According to the traditional formula, the RCI values for the same magnitude of improvement and decline change scores are the same as standard error of measurement is constant across all change scores. Therefore, the threshold for identifying significant individual changes is reciprocal for improvement and decline groups. However, if IRT-based RCI formula is used to calculate RCI values based on IRT scores, the same magnitude of improvement and decline change scores might not yield the same RCI values as the IRT standard error could be different for different IRT scores.

The STAR Trial study collected the PROMIS Fatigue 10a measure, which is widely used in clinical trials. Additionally, the STAR study also collected measures related to fatigue such as 36-Item Short Form Survey Vitality subscale (SF-36) and Multidimensional Assessment of Fatigue (MAF). We will use these two measures as change group anchors to evaluate the ability of CTT and IRT scores to detect significant individual changes. We will exclude missing values in the current study.

Requested Studies:

A Multicenter, Randomized, Double-Blind, Placebo-Controlled Study of the Safety of Human Anti-TNF Monoclonal Antibody D2E7 in Patients with Active Rheumatoid Arthritis
Data Contributor: AbbVie
Study ID: DE031
Sponsor ID: DE031

Public Disclosures:

Tang, X., Schalet, B.D., Peipert, J.D. and Cella, D., 2023. Does scoring method impact estimation of significant individual changes assessed by patient-reported outcome measures? Comparing Classical Test Theory versus Item Response Theory. Value in Health. doi:10.1016/j.jval.2023.06.002