I am hearing some people trying to use LLM to create synthetic data to train ML models. I am little confused. Is not LLMs just a probability of words. How will they know what is the data distribution required for the training data. My understanding is that LLM alone cannot do that. It needs to be married with a traditional ML model(like VAE). Am I thinking in a wrong way. Are LLMs so intelligent? my understanding of the technology says something else. Would like to correct myself if i am wrong
You can play with the settings to change the probabilities.
You can also do prompt engineering to enforce uniqueness.
You’re right—LLMs alone can’t create perfect synthetic data because they just predict words based on patterns, not true data distributions. For example, if you ask an LLM to generate fake employee records, it might produce realistic-looking names and salaries, but it won’t know Tanzania’s actual civil service pay structure unless specially trained.
LLMs can still help in limited ways, like creating test examples for ESS Utumishi’s login system (e.g., sample error messages). But for accurate synthetic data, you’d need to combine LLMs with traditional methods (like VAEs) and real-world validation. Otherwise, you risk nonsense outputs that don’t match reality.