The promise and dangers of synthetic data

Is it possible for an AI to be trained on data generated by another AI? It may sound like a harebrained idea. But it’s one that’s been around for a long time — and as new, real data gets harder to come by, it’s been gaining momentum.
Anthropic used synthetic data to train one of its flagship models, the Claude 3.5 Sonnet. Meta fine-tuned its Llama 3.1 models using AI-generated data. And OpenAI is said to be getting artificial training data from the o1, its “thinking” model, for the upcoming Orion.
But why does AI need data in the first place – and what grace of the data it needs? And it can be this data indeed replaced by synthetic data?
Importance of annotations
AI systems are mathematical machines. Trained on many examples, they learn patterns in those examples to make predictions, such as “to whom” in email often precedes “may concern.”
Annotations, which usually document the meaning or parts of the data that these programs receive, are an important part of these examples. They act as cues, “teaching” the model to distinguish between things, places, and ideas.
Consider an image classification model that shows many images of kitchens labeled with the word “kitchen.” As it trains, the model will begin to make associations between the “kitchen” and the normal features of kitchens (eg containing fridges and washing tables). After training, given an image of a kitchen that was not included in the initial examples, the model should be able to recognize it as such. (Of course, if the pictures of the kitchens were labeled “cow,” they would be seen as cows, emphasizing the importance of a good caption.)
The desire for AI and the need to provide labeled data for its development has given rise to the market for annotation services. Dimension Market Research estimates that it is worth $838.2 million today – and will be worth $10.34 billion in the next ten years. While there are no exact estimates of how many people do the labeling work, the 2022 paper puts the number in the “millions.”
Companies large and small rely on employees hired by data annotation companies to label AI training sets. Some of these jobs pay well, especially if the labeling requires specialized knowledge (eg math knowledge). Some can be backbreaking. Annotators in developing countries are only paid a few dollars an hour on average with no benefits or guarantees of future gigs.
A dried data well
So there are humanitarian reasons to seek alternatives to man-made labels. But there are also real ones.
People can label so quickly. Annotators also have biases that can be reflected in their annotations, and, in turn, any models trained on them. Annotators make mistakes, or stumble over labeling instructions. And paying people to do things is expensive.
The data as usual it’s expensive, for that matter. Shutterstock charges AI vendors tens of millions of dollars for access to its archives, while Reddit has made hundreds of millions by licensing data to Google, OpenAI, and others.
Finally, data is also becoming harder to find.
Many models are trained on massive collections of public data – data that owners increasingly choose not to reveal out of fear that their data will be identified, or that they won’t get credit or attribution for it. More than 35% of the world’s top 1,000 websites now block OpenAI’s web scraper. And about 25% of data from “high-quality” sources is limited to large datasets used to train models, one recent study found.
If the current approach to restricting access continues, research group Epoch AI projects that developers will run out of data to train productive AI models between 2026 and 2032. , forced the reckoning of AI vendors.
Other ways of doing it
At first glance, artificial data would appear to be the solution to all these problems. Need annotations? Build them. More example data? No problem. The sky is the limit.
And to some extent, this is true.
“If ‘data is the new fuel,’ synthetic data is concentrated as biofuel, created without the negative effects of the real thing,” Os Keyes, a PhD candidate at the University of Washington who studies the ethical impact of emerging technologies, told TechCrunch. . “You can take an initial small set of data and simulate and extract new entries from it.”
The AI industry has taken this concept and run with it.
This month, Mlobi, an enterprise-focused AI manufacturing company, released a model, the Palmyra X 004, that was trained almost entirely on synthetic data. Developing it cost just $700,000, the Author claims – compared to an estimate of $4.6 million for a similarly sized OpenAI model.
Microsoft’s Phi open models are trained using artificial data, in part. So were Google’s Gemma models. Nvidia this summer unveiled a family of models designed to generate artificial intelligence training data, and AI startup Hugging Face recently released what it says is the largest artificial text AI training dataset.
Artificial data generation has become a business in itself – worth $2.34 billion by 2030. Gartner predicts that 60% of data used for AI and analytics projects this year will be artificially generated.
Luca Soldaini, senior research scientist at the Allen Institute for AI, noted that artificial data techniques can be used to generate training data in a format that is not easily available from scratch (or even to license content). For example, in training its Movie Gen video generator, Meta used Llama 3 to create image captions from the training data, which people then modified to add additional details, such as lighting descriptions.
Along these same lines, OpenAI claims to have optimized GPT-4o using synthetic data to create a sketchpad-like Canvas feature for ChatGPT. And Amazon said it is generating synthetic data to supplement the real-world data it uses to train Alexa speech recognition models.
“Artificial data models can be used to quickly expand human knowledge when the data is needed to achieve a specific behavioral model,” said Soldaini.
The dangers of manufacturing
Synthetic data is not a panacea, however. It suffers from the same “garbage in, garbage out” problem as all AI. Models create synthetic data, and if the data used to train these models have biases and limitations, their results will be similarly tainted. For example, groups that are poorly represented in the primary data will be the same in the synthetic data.
“The problem is, you can only do so much,” Keyes said. “Say you only have 30 black people in the dataset. Extrapolation would help, but if those 30 people were all middle class, or all light skinned, that’s what the ‘representative’ data would look like.
So far, a 2023 study by researchers at Rice University and Stanford found that overreliance on artificial data during training can create models “whose quality or diversity gradually decreases.” Sampling bias – a poor representation of the real world – causes the model’s variability to worsen after several generations of training, according to the researchers (although they also found that mixing with real-world data helps reduce this).
Keyes sees an additional risk in complex models like OpenAI’s O1, which he thinks can reveal hard-to-detect manipulations in their synthetic data. This, in turn, can reduce the accuracy of models trained on the data – especially if the sources of manipulation are not easy to detect.
“Sophisticated models are dreaming; data generated by complex models contain hallucinations,” Keyes added. “And with a model like o1, the developers themselves can’t explain why artifacts appear.”
Compounding hallucinations can lead to gibberish-spewing models. A study published in the journal Nature reveals how models, trained on flawed data, perform even more error-ridden data, and how this feedback loop degrades future generations of models. Models lose their understanding of more esoteric knowledge over generations, the researchers found – becoming commonplace and often producing answers unrelated to the questions they ask.
Follow-up studies show that some types of models, such as image generators, are not immune to this type of collapse:

Soldaini agrees that “raw” synthetic data should not be trusted, at least if the goal is to avoid training forgetful chatbots and similar image generators. Using it “safely,” he said, requires careful processing, processing, and filtering, and ideally paired with fresh, real data — just as you would with any other dataset.
Failure to do so can ultimately lead to model collapse, where the model becomes less “creative” – and more biased – in its results, ultimately compromising its effectiveness. Although this system can be identified and arrested before it becomes serious, it is dangerous.
“Researchers need to evaluate the generated data, replicate the production process, and identify safeguards to remove low-quality data points,” Soldaini said. “Artificial data pipelines are not a self-improving machine; their product must be carefully tested and improved before it can be used for training.”
OpenAI CEO Sam Altman once argued that AI will generate artificial data good enough to train itself successfully. But – if we think that is possible – the technology is not there yet. No major AI lab has released a trained model on synthetic data alone.
At least for the foreseeable future, it seems that we will need people more and more everywhere to make sure that the training of the models does not go wrong.
Source link