Version: 1.0.0 | Published: 24 Mar 2026 | Updated: 16 days ago
Summary
Description:
The Comprehensive Patient Records research dataset relates to the medical history of cancer patients prior to cancer, their diagnosis and treatment, long-term outcomes, and medical history of matched non-cancer patients that form a comparator cohort.
Access Tier:
Safeguarded
Contact Point:
Health Theme:
Cancer
Health Category:
Electronic Health Records (EHRs)
Number of Unique Individuals:
40000
Documentation
Associated Media:
https://lida.leeds.ac.uk/wp-content/uploads/2017/01/Data-Flow-Protocol-V3.pdf
https://lida.leeds.ac.uk/comprehensive-patient-records-2/
https://datadictionary.nhs.uk/xml_schema_constraints/cancer_outcomes_and_services_data_set_xml_schema_constraints.html
https://lida.leeds.ac.uk/comprehensive-patient-records-2/patient-reported-outcome-measures/
https://www.hra.nhs.uk/planning-and-improving-research/application-summaries/research-summaries/comprehensive-patient-records-for-cancer-outcomes/
https://www.health.org.uk/funding-and-partnerships/programmes/augur-an-open-source-interactive-cancer-analytics-web-app
https://lida.leeds.ac.uk/adam-glaser/
https://www.researchgate.net/publication/336720985_Comprehensive_Patient_Records_CPR_Study_Workstream_5_Protocol_and_Methods_-_Patient_reported_costs_of_living_with_and_beyond_cancer
https://www.data-can.org.uk/datasets
Documentation:
The data is derived from linked primary, secondary and tertiary care electronic health records and participant survey responses. Data is de-identified at source (Leeds Teaching Hospitals NHS Trust (LTHT) and ResearchOne) and linked using matching pseudonymous digests that are re-pseudonymised upon linkage by University of Leeds IT to produce irreversibly pseudonymous data that is processed into a research dataset. The data relates to the medical history of cancer patients prior to cancer, during their cancer diagnosis and treatment, and following their long-term outcomes, and the medical history of matched non-cancer patients that form a comparator cohort.
The data relates to 431,352 patients in the UK that LTHT have a ‘legitimate patient relationship’ with and that were determined by LTHT to have had a cancer diagnosis between 2004 and 2018 or be a matched non-cancer patient. Where available, data from ResearchOne provides primary care information for these patients. Where the patients were invited to participate in a patient reported outcomes measures survey (PROMS), this status is recorded. Where the patient returned a consented PROMS, the PROMS data will also be available once it has completed the extract, transform and load process.
The dataset is currently 5.7 GB and further ResearchOne and PROMS data is anticipated. The dataset is arranged as a relational database, with tables linking on the patient level by a pseudonymous digest. Each table is a comma separated values (CSV) file and relates to an event type, such as prescription cost, address history or diagnosis. All patients have an entry (row) in the demographics table; the number of times a patient has an entry in the other tables depends on how many events of that type were recorded for the patient.
The dataset is split into two files, each with similar table structure; the main dataset contains all patients and the PROMs dataset contains only those in the PROMs cohort (for whom additional PROMs data will be added). Each table has a re-pseudonymised digest field, “Digest2” and an indicator as to whether the patient has data from ResearchOne available, “TPP_Linked” (0 or 1). Additional fields per table are defined in Table 1. No fields contain sensitive information.
Contains patients in the UK that LTHT have a ‘legitimate patient relationship’ with and that were determined by LTHT to have had a cancer diagnosis between 2004 and 2018 or be a matched non-cancer patient.
Coverage
Spatial
Spatial Coverage:
- United Kingdom
- England
- Yorkshire and The Humber
- Leeds
Temporal
Start Date:
01 January 2008
End Date:
12 January 2018
Frequency:
STATIC
Date of Latest Release:
12 January 2018
Date of First Release:
08 October 2024
Temporal Aggregation:
Unknown
Provenance
Origin
Purpose:
Study
Collection Situation:
Primary care - Clinic
Image Contrast:
Not stated
Access and Governance
Usage
Data Use Requirements:
- Collaboration required
- Ethics approval required
Access
Access Rights:
In Progress
Data Controller:
Leeds Teaching Hospital Trust
Data Processor:
Leeds Institute for Data Analytics
Delivery Lead Time:
2-6 months
Legal Basis:
General research use, Genetic studies only, Research-specific restrictions,
Research use only, No linkage
Health Data Access Body:
Leeds Teaching Hospital Trust (LTHT) are the data controller Professor Geoff
Hall and Professor Adam Glaser are the Principal Investigators.
Format and Standards
Language:
English
Format:
- text/csv
- text/xml
Coding System:
ICD10
Data Distribution
Data Status:
Not available
Distribution:
The Data Access will be from securely linked primary- and secondary-care data,
non-identifiable patient information from GP surgeries, community care units and
hospital records repository, eg Leeds Teaching Hospitals NHS Trust (LTHT) , Data
Request and access process., This has not yet been defined and will be developed
on a case by case basis initially.
Observations
Name
Population Type
Value
Description
Variable Measured
Unit Code
Observation Date
Number of Records
Minimum Typical Age
Maximum Typical Age
Persons
40000
CPR population
count
20 October 2017
40000
0
100
Origin
Name:
Data Catalogue