Risks and Challenges of Synthetic Data Generation in Healthcare

Sunday 09 March 2025


As synthetic data generation becomes increasingly popular, concerns are growing about its potential impact on patient privacy and healthcare disparities. This new type of data is designed to mimic real-world patients without revealing identifiable information, but it’s not without its risks.


One major issue is memorisation – a phenomenon where AI models reproduce verbatim parts of their training corpus, potentially leaking sensitive private information. This has already happened with popularly deployed GenAI models, leading to lawsuits between companies and content owners.


In healthcare, the stakes are even higher. Medical data is structured and repetitive, making it easier for models to learn patterns and memorise specific cases. Rare medical conditions are particularly vulnerable, as models struggle to generalise from fewer examples, increasing the risk of sensitive disclosures.


Synthetic data generation markets are booming, projected to reach $2.89 billion by 2028. However, healthcare adoption remains limited due to complex regulations and a lack of open-access data. Synthetic data could help fill this gap by generating additional data for underrepresented populations and specific conditions, but creating realistic and useful medical data is challenging, especially for rare cases.


Moreover, synthetic data may perpetuate or amplify biases from the original data, exacerbating healthcare disparities rather than promoting fairness and equity. Evaluations should comprehensively assess fidelity, utility, and privacy, addressing inherent trade-offs among them.


Regulatory frameworks will be crucial in unlocking synthetic data’s potential, safeguarding patient safety and privacy, and adapting to rapid technological advancements. The EU’s AI Act promotes responsible development by introducing synthetic data as a privacy-preserving alternative to personal data in high-risk AI systems. However, the legislation does not discuss privacy standards for synthetic data, leaving its legal status unclear.


To mitigate risks, healthcare stakeholders can take key precautions. Clinicians and developers should avoid using sensitive data to train GenAI models, reducing the risk of privacy breaches. Open-source models can offer greater transparency in managing privacy risks compared to closed platforms. Hosting LLMs on private cloud infrastructure can further reduce risks, though it does not fully eliminate the chance of sensitive data leakage.


Combining Privacy-Enhancing Technologies, such as federated learning, synthetic data, and homomorphic encryption, can perhaps strengthen data privacy while preserving utility. Educating users and stakeholders about risks and proper applications of GenAI is crucial, as is establishing clear policies for handling and sharing generated content.


Cite this article: “Risks and Challenges of Synthetic Data Generation in Healthcare”, The Science Archive, 2025.


Synthetic Data, Ai, Patient Privacy, Healthcare Disparities, Memorisation, Genai Models, Medical Data, Rare Medical Conditions, Regulatory Frameworks, Eu Ai Act


Reference: Gwénolé Abgrall, Xavier Monnet, Anmol Arora, “Synthetic Data and Health Privacy” (2025).


Leave a Reply