Limitations of Generative Models in Synthetic Data Generation Revealed

Monday 24 March 2025


A comprehensive evaluation of generative models, designed to mimic real-world data, has revealed significant limitations in their ability to accurately replicate complex datasets. The study, published in a recent issue of a leading scientific journal, scrutinized the performance of various algorithms used to generate synthetic data, with a focus on tabular transportation data.


The researchers tested nine different generative models, including Gaussian Copula and TabDDPM, against a range of metrics, including downstream task performance, distribution similarity, diversity, and privacy leakage. The results were striking: while some models performed well in certain areas, none excelled across the board.


One of the most notable findings was the prevalence of mode collapse, where models struggled to generate diverse and realistic data. This issue was particularly pronounced when dealing with categorical variables, such as zone information in transportation datasets. The researchers noted that this limitation may be due to the difficulty of handling large numbers of classes, which can lead to a lack of diversity in generated data.


Another key area of concern was privacy leakage, where models were found to be vulnerable to membership inference attacks. This type of attack involves determining whether or not an individual’s data is present in the synthetic dataset, potentially compromising their privacy.


The study also highlighted the importance of evaluation metrics in assessing generative model performance. The researchers proposed a novel graph-based metric, tailored specifically for transportation datasets, which provided a more nuanced understanding of model quality.


In contrast to previous studies, this comprehensive evaluation did not focus on a single aspect of model performance or a specific use case. Instead, it took a holistic approach, examining the strengths and weaknesses of various models across multiple metrics. This allowed the researchers to identify areas where improvements are needed, as well as highlighting best practices for synthetic data generation.


The findings of this study have significant implications for industries that rely on generative models, such as transportation planning and policy-making. By understanding the limitations of these algorithms, developers can work towards creating more accurate and privacy-preserving synthetic data, ultimately leading to better decision-making and improved outcomes.


The research also underscores the importance of continued evaluation and improvement in the field of synthetic data generation. As datasets become increasingly complex and large-scale, it is essential that models are able to accurately replicate this data while preserving individual privacy. By pushing the boundaries of what is possible with generative models, researchers can create more realistic and useful synthetic data, ultimately benefiting society as a whole.


Cite this article: “Limitations of Generative Models in Synthetic Data Generation Revealed”, The Science Archive, 2025.


Generative Models, Synthetic Data, Transportation Data, Tabular Data, Gaussian Copula, Tabddpm, Mode Collapse, Privacy Leakage, Membership Inference Attacks, Evaluation Metrics.


Reference: Chengen Wang, Alvaro Cardenas, Gurcan Comert, Murat Kantarcioglu, “A Systematic Evaluation of Generative Models on Tabular Transportation Data” (2025).


Leave a Reply