Medical Data Science

The innovation area focuses on AI and data science methodologies for digital health in drug discovery, preclinical and clinical research. At Fraunhofer ITMP, medical data science is centered around the four major domains of the Fraunhofer 4D concept in health research: Drugs, Devices, Data, and Diagnostics (also refer to 4D Clinic). The innovation area Medical Data Science deals with handling and analysis of various kinds of medical data, such as data from clinics and clinical trials, OMICS technologies, electronic health records, medical imaging and wearables. Our core competencies include machine learning, knowledge graphs and federated learning, and FAIR (Findable, Accessible, Interoperable, Reusable) handling of medical data.

A special focus lies on the investigation of immune-mediated diseases in cooperation with clinicians, pharmaceutical companies and academic partners. Cutting-edge machine learning algorithms are leveraged for the diagnosis, prognosis and precision medicine therapy for immune-mediated diseases. Fraunhofer ITMP possesses strong expertise in the design of software and hardware solutions (including its high-throughput laboratories) for open research platforms for both research and industry. These platforms facilitate the exploration and practical testing of digital health research concepts and commercial offerings.

 

Core competencies:

  • Machine learning for 4D (Drugs, Devices, Data, and Diagnostics)
  • Knowledge graphs and graphical neural networks for medical research
  • FAIR (Findable, Accessible, Interoperable, Reusable) medical data management and knowledge graphs
  • Generative AI and synthetic medical data
  • Biostatistical support of clinical and preclinical studies
  • Federated learning infrastructure and medical data science platform

Federated infrastructure for health

The healthcare sector is opening up to data exchange. It is testing many digital solutions, including federated learning infrastructures. Federated learning in healthcare is a machine learning approach. It tackles the challenges of medical data management and privacy. Algorithms are trained collaboratively without exchanging the data itself. Different stakeholders, such as clinics, pharma companies, academic institutions, and public organizations, can participate.


With federated learning, insights are gained via a central aggregation server, such as a consensus model. Medical or patient data never leave the firewalls of the institutions where they are stored. The machine learning model trains locally at each participating site. Only model features, like parameters or gradients, are shared. In short, the AI model moves between clients, not the data. The systematic planning of such infrastructures for customers and consortia is supported practically through prototypical research platform setup and operation.


Fraunhofer ITMP drives initiatives that enable research data exchange according to German standards (e.g., DSGVO, ethics approvals) and at the European level (European Health Data Space, GAIA-X, European Open Science Cloud, and International Data Spaces Association). We provide solutions for federated, scalable, and interoperable data infrastructures. This establishes a new paradigm for heterogeneous health research and fosters collaboration between healthcare providers, researchers, and industry partners (see as well Medical Data Space).

Study and cohort analyses

Fraunhofer ITMP supports clinical and preclinical studies and cohort analyses with our expertise in:

  • Analysis and regulation of Phase I to IV studies, POC studies
  • AI and machine learning
  • Identification of suitable target populations and optimized endpoints
  • Mathematical modelling and statistical methods
  • Project-specific input and output formats and dashboards

A broad AI toolbox of commercial and proprietary software tools is used. The competence of Fraunhofer ITMP is based on the integration of data scientists into the clinical routine with a focus on the indication areas of immune-mediated and inflammatory diseases at its Frankfurt location. In collaboration with Fraunhofer SCAI, Fraunhofer ISST and other Fraunhofer institutes, large amounts of data are used for analysis using cutting-edge artificial intelligence and machine learning, as well as the early use of these techniques to improve clinical care and knowledge gain.

Knowledge graphs for drug discovery and drug repurposing

Knowledge Graphs (KGs) are advanced forms of networks that capture the semantics of the constituent entities and the interactions among them. In context of biomedicine and life sciences, KGs represent disease-associated biological and pathophysiological phenomena by systematically assembling various inter-related entities such as proteins and their biological processes, molecular functions and pathways, chemicals and their mechanism of actions and adverse effects. They have been deployed in several use cases and downstream analyses related to healthcare, pharmaceutical and clinical settings. However, the process of creating KGs is expensive and time-consuming because it requires a lot of manual curation. Moreover, machine-aided methods such as text-mining workflows and Large Language Models (LLMs) have their own shortcomings and are improving gradually.

We have developed a fully automated workflow called Knowledge Graph Generator (KGG), for creating KGs that represent chemotype and phenotype of diseases. The KGG embeds underlying schema of curated public databases to retrieve relevant knowledge which is regarded as the gold standard for high quality data. Graph neural networks can be used for prediction in links and nodes in the KG for pre-clinical drug discovery, understanding disease mechanisms/comorbidity and drug repurposing.

The KGG is leveraged on our previous contributions to the BY-COVID project where we developed workflows for the identification of bio-active analogs for fragments identified in COVID-NMR studies (Berg, H et al., 2022) and the representation of Mpox biology (Karki, R et al., 2023).

FAIR handling and analysis of medical data

Obtaining reliable information from unstructured or heterogeneous ("dirty") data requires consistently implemented FAIR data management (Findable, Accessible, Interoperable, Reusable). At Fraunhofer ITMP, this principle is operationalised through standardised data models, harmonised system comparisons, binding conventions, established ontologies and controlled vocabularies. This is supplemented by structured exploratory data analysis (EDA) workflows and quality-assured toolchains.  

A central component of our projects is systematic data and method validation, which we transfer into dedicated validation studies. The structured moderation of "questions about the data" is particularly important here – especially in AI projects, right from the project conception phase. This ensures that hypotheses, data availability, data quality and methodological requirements are coordinated at an early stage. 

The clear definition, curated compilation and quality-assured generation of training and validation data sets are essential for the development of robust AI models. We support these processes through the use of real-world evidence data and, where necessary, through the generation of synthetic cohorts. This enables us to ensure reproducible, regulatory-compliant and scientifically robust AI developments.

IDERHA: Integration of heterogeneous data and evidence towards regulatory and HTA acceptance

IDERHA is a European public-private partnership launched in April 2023. This pioneering project addresses the obstacles in accessing, integrating and analyzing health data to maximize their value for patient care and medical research.

An open, disease agnostic, federated data space will be developed. The federated data space will enable connectivity, access, use and reuse of digital health data. In IDERHA, consensus policy recommendations on health data access and heterogeneous health research such as real-world evidence (RWE) are developed for regulatory and HTA decision making.

Partners: IDERHA is led by Fraunhofer ITMP and Johnson & Johnson Medical GmbH, in a consortium of 33 academic, clinical, medtech, pharmaceutical, and IT partners, as well as patient advocacy organizations and public authorities, including Fraunhofer institutes SCAI and ISST.

Additional Information

SYNTHIA: Synthetic data generation framework for integrated validation of use cases and AI healthcare applications

SYNTHIA is an ambitious collaboration between public and private institutions to facilitate the responsible use of Synthetic Data (SD) in healthcare applications. The project will improve the methodological and technical aspects of SD Generation (SDG) by developing new techniques and advancing established ones for different data modalities, including genomics and imaging, to improve the generation of realistic multimodal and longitudinal data.

The open SYNTHIA federated platform will facilitate responsible SD use by the health research community, in particular long-term access to extensively validated, reusable synthetic datasets, as well as to SDG workflows and SD assessment frameworks. A multidisciplinary collaboration of SDG developers, FAIR data experts, clinical researchers, developers of therapies and data-based tools, legal experts, socio-economic analysts, regulatory, policy advocacy, and communication experts will provide a 360º vision on how to advance healthcare applications through SD use.

Partners: Consortium of 43 academic, clinical, pharmaceutical, IT and public partners, including Fraunhofer institutes ITMP, SCAI and MEVIS.

Additional Information

FAIRplus (completed)

The vast amounts of data generated in life science research have the potential to add to our understanding of disease and help advance drug development. Yet most data is hidden away in proprietary databases and stored in different formats. The goal of FAIRplus is to deliver guidelines and tools to facilitate the application of FAIR principles to data from certain IMI projects and datasets from pharmaceutical companies. FAIR stands for Findable, Accessible, Interoperable, Reusable. The project will therefore make it easier for other researchers to find the data and integrate it into their own research. The project will also organise training courses for data scientists in academia, small and medium-sized enterprises (SMEs) and pharmaceutical companies. Ultimately, the project hopes to change the culture of data management in the life sciences sector.

Additional Information

Knowledge graph generator (completed)

The Knowledge Graph Generator (KGG) project, which has now been completed, developed an automated workflow for generating knowledge graphs for the life sciences, enabling a comprehensive representation of disease-associated entities such as proteins, signaling pathways, genetic variants, chemicals, mechanisms of action, assays and adverse effects. By integrating curated resources including OpenTargets, UniProt, ChEMBL, the Integrated Interactions Database, and GWAS Central, KGG created FAIR-compliant, interconnected graphs that support complex scientific queries and downstream analyses.

The project demonstrated the practical value of knowledge graphs in translational and application-focused research. Use cases included identifying shared molecular entities to explore comorbidities, discovering putative therapeutic targets, repurposing drug candidates for Parkinson’s disease, and assessing the drug-likeness of chemicals. These outcomes highlight KGG’s potential to accelerate industrial research and development, bridging preclinical findings with actionable insights for drug discovery and bringing innovation closer to market.

Resources and source code from KGG are publicly available for the research community: Additional Information

Tanoli Z, Fernández-Torras A, Özcan UO, Kushnir A, Nader KM, Gadiya Y, Fiorenza L, Ianevski A, Vähä-Koskela M, Miihkinen M, Seemab U, Leinonen H, Seashore-Ludlow B, Tampere M, Kalman A, Ballante F, Benfenati E, Saunders G, Potdar S, Gómez García I, García-Serna R, Talarico C, Beccari AR, Schaal W, …, Aittokallio T.
Computational drug repurposing: approaches, evaluation of in silico resources and case studies.
Nat Rev Drug Discov. 2025;24:521–542
doi:10.1038/s41573-025-00567-8

Kuzikov M, et al.
Experimental and machine learning-based exploration of repurposed drugs reveals chemical features underlying phospholipidosis.
Patterns. 2025;101453
doi: 10.1016/j.patter.2025.10145

Karki R, Gadiya Y, Zaliani A, Pokharel B, Babaiha NS, Ostaszewski M, Hofmann-Apitius M, Gribbon P.
KGG: a fully automated workflow for creating disease-specific knowledge graphs.
Bioinformatics. 2025 Jul;41(7):btaf383
doi: 10.1093/bioinformatics/btaf383

Gadiya Y, Genilloud O, Bilitewski U, Brönstrup M, von Berlin L, Attwood M, Gribbon P, Zaliani A.
Predicting Antimicrobial Class Specificity of Small Molecules Using Machine Learning.
J Chem Inf Model. 2025;65(5):2416-2431
doi: 10.1021/acs.jcim.4c02347

Reinshagen J, Seashore-Ludlow B, Gadiya Y, Gustavsson AL, Tanoli Z, Aittokallio T, Huchting J, Jenmalm-Jensen A, Gribbon P, Zaliani A, Ballante F.
From library to landscape: integrative annotation workflows for compound libraries in drug repurposing.
Database. 2025;2025:baaf081
doi: 10.1093/database/baaf081

Rischke S, Schäfer SMG, König A, Ickelsheimer T, Köhm M, Hahnefeld L, Zaliani A, Scholich K, Pinter A, Geisslinger G, Behrens F, Gurke R.
Metabolomic and lipidomic fingerprints in inflammatory skin diseases - Systemic illumination of atopic dermatitis, hidradenitis suppurativa and plaque psoriasis.
Clin Immunol. 2024 Aug;265:110305
doi: 10.1016/j.clim.2024.110305

Gadiya Y, Shetty S, Hofmann-Apitius M, Gribbon P, Zaliani A.
Exploring SureChEMBL from a drug discovery perspective.
Sci Data. 2024;11:507
doi:10.1038/s41597-024-03371-4

Gyrard A, Gribbon P, Hussein R, Abedian S, Bonmati LM, Cabornero GL, Manias G, Danciu G, Dalmiani S, Autexier S, Nuland R, Jendrossek M, Avramidis I, Alvarez EG.
Synergies Among Health Data Projects with Cancer Use Cases Based on Health Standards.
Stud Health Technol Inform. 2024;316:1292-1296
doi: 10.3233/SHTI240649

Karki R, Gadiya Y, Gribbon P, Zaliani A.
Pharmacophore-Based Machine Learning Model To Predict Ligand Selectivity for E3 Ligase Binders.
ACS Omega. 2023 Aug 9;8(33):30177-30185
doi: 10.1021/acsomega.3c02803