What are the re-identification risk scores of publicly available anonymised clinical trial datasets?

Lead Investigator: Aryelly Rodriguez, The University of Edinburgh
Title of Proposal Research: What are the re-identification risk scores of publicly available anonymised clinical trial datasets?
Vivli Data Request: 7400
Funding Source: None
Potential Conflicts of Interest: None

Summary of the Proposed Research:

There are increasing pressures for anonymised datasets from clinical trials to be shared across the scientific community. Some anonymised datasets are now publicly available for secondary research. However, we do not know if they pose a privacy risk to the involved patients. We have 3 equations that can be used to calculate the re-identification risk scores using El-Emam’s three derived risk metrics (equations) under the prosecutor and the journalist scenarios for an entire anonymised dataset, using information in the anonymised dataset. Re-identification risk score is estimated probability of any given individual being re-identified from an anonymised/de-identified dataset. The re-identification risk score depends on the variables available in the dataset, the number of observations in the dataset and on the strategy used to attack the dataset (prosecutor or journalist scenario). These equations only generate numbers, and they do not aim to actually re-identify individuals in the datasets. We aim to collect a broad random sample of publicly available, anonymised clinical trial datasets to calculate their re-identification risk scores. Step 1: We will contact data holders and request access to their anonymised datasets following the data owners’ local procedures. Step 2: Re-identification risk scores will be calculated for each dataset, using the 3 equations. Step 3: We will investigate what characteristics of the datasets are associated with increased or decreased risk score, compare the risk scores and their usability, and discuss our findings. To the best of our knowledge, this will be the first study to use these risk of re-identification scores across a range of clinical trials datasets.

Statistical Analysis Plan:

Aryelly Rodriguez will search datasets in Vivli and collect all available metadata about the potentially eligible datasets. This will be put on an excel spreadsheet. Then a random selection of five datasets will be drawn from the eligible datasets in Vilvi.
Datasets will be excluded if:
1. They are not explicitly declared as anonymised/de-identified and suitable for sharing
2. They are not from a randomized controlled trial (RCT)
3. They are not from human participants
4. They are in a language that is not English or Spanish
For each selected anonymised/de-identified clinical trial dataset we will calculate:
1. Number of indirect identifiers present in the datasets as described by Hrynaszkiewicz et al.
2. Re-identification Risk Score A (Ra) = The proportion of records that have a re-identification probability higher than 4 pre-defined thresholds (0.1 0.2 0.3 and 0.4), using all indirect identifiers in the dataset
3. Re-identification Risk Score B (Rb) = The Worst case scenario or weakest point in the dataset. The smallest unique group of participants (regarding all indirect identifiers in the dataset) generates the highest risk score for the whole dataset.
4. Re-identification Risk Score C (Rc) = The expected value or average risk score across all of the records in the dataset, using all indirect identifiers in the dataset.
Each re-identification risk scores (A, B and C) will be estimated under the prosecutor and journalist scenario. For the latter a theoretical matching dataset will be generated. The matching datasets are going to be 15 times bigger than the anonymised/de-identified datasets and they will be tailor-made to contain relevant matching indirect identifiers. For further details regarding the calculation of the re-identification risk scores please see Appendix 1 of the study protocol
Appendix 2 of the study protocol has a list of all the data repositories that we are planning to visit
There would not be any data imputation/manipulation or re-analysis of clinical trials outcomes. All risk scores will be calculated using the SAS version available on Vivli. Datasets from different trials would not be merge in anyway, as we are interested in the re-identification risk score for each individual trial dataset. External data (i.e. data from other repositories) would not be brought into the Vivli Research Environment. We are only going to required that the calculated re-identification risk score to be exported from Vivli.

Requested Studies:

Immunogenicity and Safety Study of GSK Biologicals’ Quadrivalent Influenza Vaccine (GSK2282512A) When Administered in Children
Data Contributor: GlaxoSmithKline
Study ID: NCT01198756
Sponsor ID: 113314

A Dose-ranging Study of Vilanterol (VI) Inhalation Powder in Children Aged 5-11 Years With Asthma on a Background of Inhaled Corticosteroid Therapy
Data Contributor: GlaxoSmithKline
Study ID: NCT01573767
Sponsor ID: 106853

A Clinical Outcomes Study to Compare the Effect of Fluticasone Furoate/Vilanterol Inhalation Powder 100/25mcg With Placebo on Survival in Subjects With Moderate Chronic Obstructive Pulmonary Disease (COPD) and a History of or at Increased Risk for Cardiovascular Disease
Data Contributor: GlaxoSmithKline
Study ID: NCT01313676
Sponsor ID: HZC113782

A Randomised, Double-blind, Placebo-controlled, Incomplete Block, 4-period Crossover, Study to Investigate the Effects of 5-day Repeat Inhaled Doses of Fluticasone Propionate (BID, 50-2000 mcg) on Airway Responsiveness to Adenosine 5-monophosphate (AMP) Challenge When Delivered After the Last Dose in Mild Asthmatic Subjects.
Data Contributor: GlaxoSmithKline
Study ID: NCT00400855
Sponsor ID: SIG103337

A Multi-Center, Open-label, Randomized Study to Evaluate the Long Term Effectiveness of Levetiracetam as Monotherapy in Comparison With Oxcarbazepine in Subjects With Newly or Recently Diagnosed Partial Epilepsy
Data Contributor: UCB
Study ID: NCT01498822
Sponsor ID: N01367

Public Disclosures:

Rodriguez, A., Williams, L.J., Lewis, S.C., Sinclair, P., Eldridge, S., Jackson, T., and Weir, C.J. Using re-identification risk scores on publicly available anonymised clinical trial datasets. ICTMC 2024. Abstract PS.8B-3. 2024.