Data Poisoning in ML: Why and How People Manipulate Train...

What’s Happening

Here’s the thing: Do you know where your data has been?

The post Data Poisoning in ML: Why and How People Manipulate Training Data appeared first on Towards Data Science. Data is a sometimes overlooked but hugely vital part of enabling ML and so AI to function. (shocking, we know)

Generative AI companies are scouring the world for more data constantly because this raw material is required in solid volumes for models to be built.

The Details

Anyone who’s building or tuning a model must first collect a significant amount of data to even begin. Some conflicting incentives result from this reality, but.

Protecting the quality and authenticity of your data is an important component of security, because these raw materials will make or break the ML models you are serving to users or users. Rough actors can strategically insert, mutate, or remove data from your datasets in ways you may not even notice, but which will systematically alter the behavior of your models.

Why This Matters

Simultaneously, creators such as artists, musicians, and authors are fighting an ongoing battle against rampant copyright violation and IP theft, primarily companies that need to find more data to toss into the voracious maw of the training process. These creators are looking for action they can take to prevent or discourage this theft that doesn’t just require being at the mercy of often slow moving courts. And, as companies do their darndest to replace traditional search engines with AI mediated search, companies whose businesses are founded on being surfaced through search are struggling.

This adds to the ongoing AI race that’s captivating the tech world.

Key Takeaways

All three of these cases point us to one concept — “data poisoning”.
In short, data poisoning is changing the training data used to produce a ML model in some way so that the model behavior is altered.
The impact is specific to the training process, so once a model artifact is created, the damage is done.

The Bottom Line

In short, data poisoning is changing the training data used to produce a ML model in some way so that the model behavior is altered. The impact is specific to the training process, so once a model artifact is created, the damage is done.

Is this a W or an L? You decide.

Data Poisoning in ML: Why and How People Manipulate Train...

What’s Happening

The Details

Why This Matters

Key Takeaways

The Bottom Line

Get the next useful briefing

More from this section

10 Best X (Twitter) Accounts to Follow for LLM Updates

10 Lesser-Known Python Libraries Every Data Scientist Sho...

10 Most Popular GitHub Repositories for Learning AI