LSD Associated Team
(2025-2028)
Leveraging Synthetic Data From Generative Models
Partners:
- UdeM Mila (Canada)
- Inria OCKHAM (France)
Generative models are machine learning models that learn and replicate the underlying structure of the data. The quality of the data they produce has reached a level that surpasses the human ability to differentiate between real and synthetic data. This opens up the possibility of virtually unlimited access to realistic synthetic data, which could be leveraged for data augmentation, especially in situations where real-world data is scarce. In fields such as physics, clinical applications, and protein design, synthetic data can enrich datasets and enhance model generalization. With the rise of indistinguishable synthetic content being generated and shared online, deployed systems now face the unprecedented challenge of managing synthetic data alongside authentic data. A growing concern in generative AI is “self-consuming” models, which are retrained on their own (previously generated) data. Over time, this recursive process can yield overfitting artifacts, accumulation of biases, or inaccuracies, ultimately causing critical model degradations (a.k.a., model collapse). Leveraging the complementary expertise of the involved partners in generative modelling, optimization, and open-source software, the project aims to systematically investigate the risks and benefits of interactions between learning algorithms and synthetic data. One key objective is to assess when and to what extent generative models can improve performance on downstream tasks. Another goal is to quantify the rate at which “self-consuming” models collapse and to develop strategies to mitigate it. More broadly, the project will explore how generative models behave and interact when deployed in shared environments with multiple models or agents. This includes understanding how they influence each other’s outputs, how they potentially cooperate or compete, and the impact of these interactions on overall system performance.