by Sharon Rossouw
Data standards support the robust organization, analysis, and reporting of clinical trials and many regulatory authorities mandate the use of standardized data structures for clinical trial submissions.
The Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM) is a framework used to represent trial data in a standardized format for review. The SDTM structure has been adopted as the required data structure of many regulators.
Mapping raw clinical trial data to the SDTM framework can be tedious since raw data formats are diverse and may change during trial execution, resulting in repetitive conversion validation cycles. The initial mapping of source variables to SDTM domains and variables is a key step in the SDTM conversion process.
This paper introduces a machine-learning approach to generate SDTM domain mapping recommendations for domain and variable targets, discusses the accuracy of the underlying models, and presents refinement steps to improve the accuracy of model predictions.
JETConvert, Bioforum’s next-generation SDTM conversion platform leverages the machine-learning solutions described in this paper for SDTM mapping.
Clinical trial data are diverse and influenced by collection methods, trial indication and research objectives. The complexity and amount of data collected during trials is increasing, driven by the wider use of, inter alia, adaptive trial designs, wearable devices, real world data, and an evolving landscape of guidance documents and therapeutic area user guides.
Many regulatory authorities, such as the Food and Drug Administration, require trial data to be submitted to them in a standardized format , namely, the Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM) . However, the structure of collected data files is varied and data-collection system dependent. Contract research organizations, having a wide customer-base, typically receive data in dissimilar customer-specific formats adding complexity to the conversion of raw data to the SDTM framework .
Although the adoption of uniform data collection standards has reduced the effort required to produce a submission ready SDTM database, the transformation of raw data files to the SDTM standard remains a tedious manual process . The mapping process is iterative per raw data variable: determine the target SDTM domains, identify target variables within the target domain and apply controlled terminology to the variable values. The current industry practice is to map raw variables to the SDTM standard in a specification and convert the raw clinical data, per the specification, using project-specific data transformation programs .
Source-to-target SDTM mapping requires repetitive decision making. The manual conversion of raw data from a single clinical trial to the SDTM framework can consume weeks of work by experienced clinical programmers . Is it possible to automate the transformation process, maintain the flexibility of SDTM standards, and capitalize on the experience of SDTM and trial experts?
Artificial intelligence, machine learning in particular, is well-established technology for automating repetitive decision making . Can we leverage this opportunity to use machine learning to assist in the preparation of SDTM submission-ready packages?
 U.S. Food & Drug Administration, “Data Standards Catalog,” Rockville, 2022.
 CDISC, “Study Data Tabulation Model,” 29 November 2021. [Online]. Available: https://www.cdisc.org/standards/foundational/sdtm. [Accessed 25 November 2022].
 S. Vijendra,et al, “The Elusive Goal of Automation in SDTM Mapping: A CRO Perspective,” in PHUSE Connect,
 PHUSE, “Industry Experiences Submitting Standardized Study Data to Regulatory Authorities,” 15 September 2020. [Online]. Available: https://phuse.s3.eu central1.amazonaws.com/Deliverables/Optimizing+the+Use+of+Data+Standards/Industry+Experiences+Submitting+Standar dised+Study+Data+to+Regulatory+Authorities.pdf. [Accessed 25 November 2022].
 J. Fulton, “Ensuring Consistency Across CDISC Dataset Programming Processes,” in PharmaSUG, Virtual Proceedings, 2020.
 J.E. Stuelpner, et al, “Data Transformation: Best Practices for When to Transform Your Data,” in PharmaSUG, Virtual Proceedings, 2020.
 S. Brown, “Machine learning, explained,” MIT, Sloan School of Management, 21 April 2021. [Online]. Available: https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained. [Accessed 13 December 2022].
 CDISC, “Controlled Terminology,” 30 September 2022. [Online]. Available: https://www.cdisc.org/standards/terminology/controlled-terminology. [Accessed 14 December 2022].