FaceDeer

FaceDeer@fedia.io · 2 hours ago

Also, what do you mean by synthetic data? If it’s made by AI, that’s how collapse happens.

But that’s exactly my point. Synthetic data is made by AI, but it doesn’t cause collapse. The people who keep repeating this “AI fed on AI inevitably dies!” Headline are ignorant of the way this is actually working, of the details that actually matter when it comes to what causes model collapse.

If people want to oppose AI and wish for its downfall, fine, that’s their opinion. But they should do so based on actual real data, not an imaginary story they pass around among themselves. Model collapse isn’t a real threat to the continuing development of AI. At worst, it’s just another checkbox that AI trainers need to check off on their “am I ready to start this training run?” Checklist, alongside “have I paid my electricity bill?”

The problem with curated data is that you have to, well, curate it, and that’s hard to do at scale.

It was, before we had AI. Turns out that that’s another aspect of synthetic data creation that can be greatly assisted by automation.

For example, the Nemotron-4 AI family that NVIDIA released a few months back is specifically intended for creating synthetic data for LLM training. It consists of two LLMs, Nemotron-4 Instruct (which generates the training data) and Nemotron-4 Reward (which curates it). It’s not a fully automated process yet but the requirement for human labor is drastically reduced.

the only way to guarantee training data isn’t from its own model is to make it yourself

But that guarantee isn’t needed. AI-generated data isn’t a magical poison pill that kills anything that tries to train on it. Bad data is bad, of course, but that’s true whether it’s AI-generated or not. The same process of filtering good training data from bad training data can work on either.

FaceDeer@fedia.io · 8 hours ago

It’s not wrong for either to draw inspiration from the other. It’s the hypocrisy that’s wrong.

FaceDeer@fedia.io · 8 hours ago

I’ve made similar points in the past in discussions about robot soldiers going to war. There’s an upside to these things that people insist on overlooking; they follow their programming. If you program a robot soldier to never shoot at an ambulance, then it will never shoot at an ambulance even if it’s having a really bad day. Same here, if the security robot has been programmed never to leave the public sidewalk then it’ll never leave the public sidewalk.

It’s always possible for these sorts of things to be programed to do the wrong things, of course. But at least now we have the ability to audit that sort of thing.

FaceDeer@fedia.io · 8 hours ago

Are you suggesting that the same amount of crime is happening but they’re deciding not to report it because there’s a robot there? That’s the measure they’re touting, the reduction in crime reports.

FaceDeer@fedia.io · 8 hours ago

You joke, but presumably that’s when it recharges.

FaceDeer@fedia.io · 19 hours ago

It’s a common pattern. Something actually bad exists, and a word is invented to describe that bad thing. People want to call the things they don’t like by that bad word, even if it’s not quite right, so the definition starts to widen a bit. It’s a very bad thing so it’s good to call things you don’t like by that word, it makes everyone else hate them too! The word stretches and stretches, and eventually everything vaguely bad is called that word. It loses its meaning.

A new word is invented to describe some specific actually bad thing. Repeat.

FaceDeer@fedia.io · 22 hours ago

Things change. There was a period before this information was easily available; this repository only goes back to 2013. Now there’s a period after this information, too. Things start and eventually they end.

Here’s hoping that some neat new things start up in its place.

FaceDeer@fedia.io · 24 hours ago

They’re not both true, though. It’s actually perfectly fine for a new dataset to contain AI generated content. Especially when it’s mixed in with non-AI-generated content. It can even be better in some circumstances, that’s what “synthetic data” is all about.

The various experiments demonstrating model collapse have to go out of their way to make it happen, by deliberately recycling model outputs over and over without using any of the methods that real-world AI trainers use to ensure that it doesn’t happen. As I said, real-world AI trainers are actually quite knowledgeable about this stuff, model collapse isn’t some surprising new development that they’re helpless in the face of. It’s just another factor to include in the criteria for curating training data sets. It’s already a “solved” problem.

The reason these articles keep coming around is that there are a lot of people that don’t want it to be a solved problem, and love clicking on headlines that say it isn’t. I guess if it makes them feel better they can go ahead and keep doing that, but supposedly this is a technology community and I would expect there to be some interest in the underlying truth of the matter.

FaceDeer@fedia.io · 1 day ago

No, researchers in the field knew about this potential problem ages ago. It’s easy enough to work around and prevent.

People who are just on the lookout for the latest “aha, AI bad!” Headline, on the other hand, discover this every couple of months.

FaceDeer@fedia.io · 1 day ago

AI already long ago stopped being trained on any old random stuff that came along off the web. Training data is carefully curated and processed these days. Much of it is synthetic, in fact.

These breathless articles about model collapse dooming AI are like discovering that the sun sets at night and declaring solar power to be doomed. The people working on this stuff know about it already and long ago worked around it.

FaceDeer@fedia.io · 2 days ago

This is “technology news and articles?”

Seems like this place is increasingly just people yelling at AI-generated clouds.

FaceDeer@fedia.io · 2 days ago

Sometimes headshots develop spontaneously. It’s a rare condition, but convenient. Some claim John F. Kennedy suffered from this condition.

FaceDeer@fedia.io · 2 days ago

Last I heard they hadn’t found the knife yet.

FaceDeer@fedia.io · 3 days ago

I recall seeing a list of the most dangerous jobs in America and “President of the United States” topped it due to the high percentage of people with that job who’ve been shot.

FaceDeer@fedia.io · 4 days ago

But at least that crappy bug-riddled code has soul!

FaceDeer@fedia.io · 4 days ago

But yeah I mean there probably would be some survivors.

This is literally the whole point I’m making. I really don’t get the downvotes, it seems perfectly straightforward.

FaceDeer@fedia.io · 4 days ago

I’m not Malthusian. What does Malthusianism have to do with this?

FaceDeer@fedia.io · 5 days ago

It’s very straightforward math based on the article you posted. It’s not saying that a nuclear war wouldn’t be bad, or shouldn’t be avoided. Of course that should be avoided.

My issue is with the people who insist that humanity as a species is at risk from nuclear war. That’s the part that’s wrong.

FaceDeer@fedia.io · 5 days ago

In Tyreek’s post-arrest press conference he asked rhetorically “what would have happened if I hadn’t been famous?”

Well, now we see. Wrist-slaps with no actual long-term impact.

FaceDeer@fedia.io · 5 days ago

Removed by mod