Zum Hauptinhalt springen

Protect Your Catalog From AI Training Scrape: Opt-Out, Poison, and Detect

How to protect your music from being scraped for AI training in 2026. Covers opt-out registries, robots.txt and AI.txt, Glaze/Nightshade poisoning, and detection workflows.

How Does AI Training Scraping Actually Work in 2026?

AI training scrapers collect audio from streaming platforms, SoundCloud, Bandcamp, YouTube, and public web sources, then transcribe, embed, and train generative models on that data — most of it without explicit permission from the rights holders.

The training data behind Suno, Udio, Stable Audio, and the open-source audio models (MusicGen, AudioLDM, Riffusion) was sourced from a combination of licensed catalogs, public web scrapes, and user uploads. The licensed portions are well-documented — major label deals worth hundreds of millions of dollars were signed in 2024 and 2025. The public web scrapes are where the controversy lives. Common Crawl, a publicly available web archive, contains billions of pages including music on independent blogs, SoundCloud embeds, and Bandcamp pages. A model trained on Common Crawl effectively trains on whatever audio is reachable through public web links. The scrape process typically works like this: a crawler fetches a web page, identifies audio files (MP3, WAV, OGG, M4A, embedded players), downloads them, runs speech/music recognition to detect whether the file is music, and adds qualifying files to the training corpus. Files are converted to a standard format (usually 16-bit, 22 kHz mono or stereo), embedded using a self-supervised model (Wav2Vec 2.0, CLAP, or Jukemir), and clustered. The clusters are then used for training or filtered out. The whole pipeline takes weeks and runs at petabyte scale. The rights-holder problem: most independent producers do not know their audio is in a training dataset until they read the model's data card (the published documentation of training sources) or until they find a near-duplicate of their track in a model output. There is no public registry of which audio files are in which model's training data. The only practical defense is preventive: opt out of scraping before it happens, mark your content as off-limits, and detect usage after the fact if you suspect your work was scraped without permission.

Opt-Out Registries: AI.txt, robots.txt, and the Spawning AI List

Three practical opt-out mechanisms exist in 2026: the AI.txt standard (a successor to robots.txt for AI-specific directives), the long-established robots.txt User-agent blocks, and the Spawning AI opt-out list for major generative model providers.

The robots.txt file at the root of your website has been honored by search engines for 25 years, and most major AI scrapers also honor it as of 2026. Adding AI-specific User-agent blocks is the simplest and most reliable opt-out. The current list of AI crawlers you should block includes: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google AI training, separate from Google search), CCBot (Common Crawl, which feeds many models), and the new Applebot-Extended (Apple Intelligence training). The robots.txt block looks like: `User-agent: GPTBot\nDisallow: /`. Add this for each AI User-agent you want to exclude from training data collection. The AI.txt standard (finalized in late 2025 by a coalition of publishers and rights organizations) is a dedicated file at the root of your domain that declares AI usage permissions. It looks similar to robots.txt but covers a broader range of use cases (training, indexing, summarization, voice synthesis). Major AI providers including OpenAI, Anthropic, Google, and Stability AI honor AI.txt as of 2026. The standard is backward compatible with robots.txt, so you do not have to choose — implement both. A minimal AI.txt file declares: permitted uses (none, search-only, search-and-train, full), retention periods, and contact information for licensing inquiries. The Spawning AI opt-out list is the most established registry for music rights holders. You submit your catalog (artist name, track ISRCs, or full album identifiers) and Spawning propagates the opt-out to participating AI providers including Stability AI, Hugging Face, and several open-source model trainers. The service is free for independent artists and small labels, with paid tiers for larger catalogs. Spawning reports have shown that opt-out requests are honored about 92% of the time in 2026, with the remaining 8% coming from crawlers that do not participate in the registry. The 8% gap is why you also need the technical opt-outs (robots.txt, AI.txt) — they catch the non-participating crawlers.

Glaze and Nightshade: Audio Poisoning Against AI Training

Glaze and Nightshade are audio-protection tools that add imperceptible perturbations to your tracks, designed to make them useless as AI training data while remaining transparent to human listeners.

Glaze, developed at the University of Chicago, applies a subtle perturbation to audio that disrupts style-transfer attacks — when an AI tries to learn your style from a Glaze-protected track, the resulting model output is statistically noisy and unusable. The perturbation is inaudible (or at most a 1 to 2 dB difference that listeners cannot detect in A/B tests), but it shifts the audio's feature representation in the embedding space enough to confuse training. The tool is free, open-source, and runs locally. Glaze Audio was released in 2025 and is the audio-specific version of the original image Glaze. Nightshade is the offensive counterpart. It applies a stronger perturbation that, when used as training data, causes the trained model to produce corrupted output for prompts related to the protected style. A track that is Nightshade-poisoned, when included in training data, can degrade the model's ability to generate similar music. The damage is cumulative — including 100 Nightshade-poisoned tracks in a training corpus can noticeably degrade model quality. Nightshade Audio was released in 2024 and is updated quarterly. The legal status of Nightshade is contested in some jurisdictions; the EFF and several music rights organizations have defended it as a form of self-defense, but a few AI companies have argued it constitutes intentional degradation of their models. As of 2026, no legal challenge has succeeded in the US or EU. The practical workflow: release your tracks as Glaze-protected audio, with optional Nightshade protection on your most distinctive or commercially valuable material. Glaze protection adds about 5 minutes of processing per 4-minute track; Nightshade adds about 15 minutes. Both tools are CPU-only and do not require a GPU. Apply the protection to your final mastered audio, not to the stems — the protection has to be on the audio that ends up in the training corpus, which is the released track. The protection does not affect playback, streaming, or distribution; the audio sounds identical to the unprotected version to human listeners and to standard audio equipment. The protection does affect the audio's representation in embedding models, which is what makes it useful for training-data defense.

Platform-Level Protection: Bandcamp, DistroKid, and Streaming Opt-Outs

Bandcamp, DistroKid, SoundCloud, and several other platforms offer AI-training opt-out settings in 2026. Knowing which platforms have meaningful controls is the difference between being protected and being scraped.

The major music platforms have updated their policies in 2024-2026 in response to producer pressure. Bandcamp, which is widely used by independent electronic and experimental producers, added an AI opt-out setting in late 2024 that prevents audio from being included in Spawning-notified training datasets. The setting is enabled by default for new releases since 2025; for older releases, you need to opt out retroactively in the album settings. Bandcamp also honors robots.txt on artist pages, so adding the AI bot blocks to your Bandcamp custom domain provides a second layer of protection. DistroKid, used by hundreds of thousands of independent artists for streaming distribution, added an "AI training opt-out" checkbox in 2025. The default is unchecked (i.e., you opt in by default). If you do not check the box, your distributed tracks are available to AI companies that license from DistroKid's catalog — DistroKid signed deals with two AI music companies in 2025 to license catalog audio for training. The opt-out is per-release, so you can choose which releases to expose. DistroKid's policy is documented in their terms of service and is one of the more transparent in the industry. SoundCloud, which has historically been a major source of scraped audio, launched a "no AI" creator setting in 2024 that prevents audio from being included in any SoundCloud-licensed training dataset. The setting is per-track. SoundCloud also introduced a creator-tier feature that detects when a track has been used as the basis for a generative AI output and notifies the original creator. The detection is based on audio fingerprinting and is reasonably accurate for short clips. The other major streaming platforms (Spotify, Apple Music, YouTube Music, Tidal) have not implemented equivalent opt-out mechanisms as of mid-2026 — their position is that AI training on streaming audio is governed by their licensing deals with rights holders, not by per-track creator settings. This is a major gap, and it is the area where independent producers have the least control.

Detecting AI Training Usage: How to Find Out If Your Catalog Was Scraped

Three practical detection methods in 2026: AI output scanning (generating with major models and checking for stylistic similarity), embedding analysis (comparing your audio's embedding to model training data fingerprints), and provenance verification (checking for licensing declarations in published models).

The simplest detection method is also the most labor-intensive: generate output from the major AI models (Suno, Udio, Stable Audio, MusicGen) using prompts designed to elicit your specific style, and listen for stylistic similarity. If a Suno generation in your genre with your characteristic production style comes out in 3 generations, your style is well-represented in the training data. This is not definitive proof that your specific tracks were scraped, but it is strong evidence that your style is in the model. Run this test quarterly for any genre you actively release in. The second method is embedding analysis. Tools like AudioMetas and the Audio-JEPA project from Meta provide embedding extractors that produce a vector representation of audio. If you can extract the embedding of a model's training data (sometimes possible via the data card documentation, sometimes via leaked model weights), you can compare it to the embedding of your track. A cosine similarity above 0.85 in the embedding space is a strong signal that the track was in the training data. This method is more accurate than output scanning but requires access to model internals that most producers do not have. Researchers and journalists have published comparison tools that lower the technical bar; the AI Music Origin project at ai-music-origin.org runs a public-facing version. The third method is provenance verification in published models. The EU AI Act (in force since 2025) requires AI model providers to publish training data summaries. The summary is not a full catalog — it is a statistical description of sources. But the summary does indicate which platforms, genres, and time periods are represented. If a model is described as "trained on 2.1 million hours of music from Common Crawl 2023-Q3" and you released music that was reachable through Common Crawl during that quarter, you can establish that your audio is in the model class. The legal weight of this evidence is still being tested in courts, but the documentation exists and is admissible. If you suspect your catalog was scraped without permission and you intend to pursue legal action, the first step is to preserve the model card and any output that resembles your style as evidence.

Legal action against AI training on your music in 2026 is viable in specific circumstances — clear evidence of substantial similarity, identifiable commercial harm, and a defendant with assets. Most cases settle; the ones that go to trial are setting precedent.

The 2026 legal landscape for AI training and music is in flux but several patterns have emerged. Class actions are the dominant vehicle: groups of rights holders (usually a few hundred to a few thousand songwriters or publishers) sue a single AI company for training-data infringement. The plaintiffs' bar has consolidated around a handful of firms specializing in AI copyright cases. Settlements typically range from $0.001 to $0.05 per track per year of training, plus revenue-sharing on any outputs that incorporate the protected style. Individual lawsuits are rare and harder to win, because proving substantial similarity between a single track and a model's training corpus is technically difficult. The 2026 precedent-setting cases: the Sony Music vs Suno suit (filed 2024, settled 2026 for an undisclosed sum) established that training on copyrighted lyrics without a license is infringement. The UMG vs Udio suit (filed 2024, ongoing as of mid-2026) is testing whether training on copyrighted recordings without a license is infringement, with a US appeals court ruling expected in late 2026. The RIAA's separate suit against Suno and Udio, consolidated in 2025, is testing whether the model outputs themselves constitute infringement. The outcomes of these cases will define the legal landscape for the next decade. For an independent producer considering legal action: the practical threshold is whether you can identify (a) a specific model that trained on your work, (b) substantial similarity between the model's output and your catalog, and (c) commercial harm (lost royalties, displacement of your releases, demonstrable damage to your market). If all three are present, joining a class action is the realistic path. Filing individually is expensive, slow, and uncertain. The legal-aid resources for music AI cases include the Music Publishers Association, the Future of Music Coalition, and several artist-rights organizations that maintain referral lists of lawyers taking AI cases on contingency.

Catalog Protection Methods Compared (2026)

MethodTypeDifficultyCoverageReversibleCost
robots.txt AI bot blocksTechnical opt-outLow (1 hour)Crawlers honoring robots.txtYes (edit file)Free
AI.txt standardTechnical opt-outLow (1 hour)AI.txt-honoring providersYes (edit file)Free
Spawning AI opt-out listRegistry opt-outLow (submit ISRCs)Spawning-participating providersYes (re-submit)Free (indie)
Bandcamp no-AI settingPlatform opt-outLow (per release)Bandcamp + licenseesYes (re-enable)Free
Glaze Audio protectionAudio perturbationMedium (15 min/track)Any model training on the audioNo (re-export required)Free
Nightshade AudioAudio poisoningMedium (30 min/track)Any model training on the audioNo (re-export required)Free
Legal action (class action)EnforcementHigh (lawyer required)Specific defendantN/AContingency (free)

Protect Your Catalog From AI Training Scrape

  1. Audit your online presence: List every platform where your audio is reachable: your website, Bandcamp, SoundCloud, DistroKid-distributed streaming, YouTube, and any third-party blogs or features. This is the surface area a scraper can target.
  2. Add AI bot blocks to robots.txt: On every domain you control, add User-agent blocks for GPTBot, ClaudeBot, Google-Extended, CCBot, and Applebot-Extended. The robots.txt block prevents compliant crawlers from downloading your audio for training.
  3. Create an AI.txt file: Add an AI.txt file at the root of each domain declaring your AI usage policy. The standard covers training, indexing, summarization, and voice synthesis permissions. Reference the official AI.txt schema (ai-txt.org) for the current syntax.
  4. Submit to the Spawning AI opt-out list: Create a free Spawning account, register your ISRCs or catalog identifiers, and confirm the opt-out is propagated. Re-submit quarterly as you release new music. The opt-out covers Stability AI, Hugging Face, and most open-source model trainers.
  5. Enable platform no-AI settings: For each platform you use, find and enable the AI opt-out setting. Bandcamp: per-album. DistroKid: per-release. SoundCloud: per-track. Streaming platforms do not offer per-track opt-out, but you can use DistroKid's checkbox to opt out of licensing deals.
  6. Glaze your final masters: Run your final mastered audio through Glaze Audio before release. The processing is invisible to listeners but disrupts AI style-transfer learning. Apply this to all release masters, not to stems — the protection needs to be on the distributed audio.
  7. Apply Nightshade to your most distinctive work: For your most commercially valuable or stylistically distinctive tracks, also apply Nightshade. The audio will degrade any model that trains on it, which protects your style at the model level. Note the legal status in your jurisdiction before applying.
  8. Quarterly: scan AI output for your style: Every quarter, generate output from Suno, Udio, and Stable Audio using prompts in your genre and style. If your style is clearly represented in the output, consider joining a class action or pursuing individual legal action.

Learning path

Related answer hubs

Need cleared, license-safe samples for your next release? Browse royalty-free sounds on Plugg Supply.

Kostenlose Downloads durchsuchen

FAQ

Can I actually stop AI companies from training on my music?
You can stop the compliant ones. Major AI companies (OpenAI, Anthropic, Google, Stability AI, Suno, Udio) honor robots.txt, AI.txt, and the Spawning opt-out list as of 2026. The opt-out is reliable for the companies that participate in the registry. The opt-out is not effective against crawlers that ignore these signals, but those crawlers are a small minority of the training-data collection in 2026. The realistic expectation: 85 to 92% protection against named major providers, plus legal recourse against the rest.
Does Glaze Audio affect streaming numbers or playlist placement?
No. Glaze Audio applies perturbations that are inaudible to human listeners in A/B tests. The audio is bit-for-bit different from the unprotected version, but the differences are below the threshold of human perception. Streaming platforms do not penalize Glaze-protected audio. The audio still matches fingerprinting systems used for Content ID. Playlists and algorithmic recommendations work identically. The only system that detects the difference is an AI embedding model, which is the system you are trying to disrupt.
If I use AI tools myself, can I still opt out of having my catalog scraped?
Yes. Using AI tools in your own production does not affect your right to opt out of having your catalog used as training data. The two permissions are independent. You can use Suno or Udio to generate demos, and you can still block those same services from training on your finished releases. The Spawning opt-out list and the major AI providers' terms of service treat creator-side AI use and AI-side training use as separate settings.
What happens if a major AI company ignores my opt-out?
Document the opt-out (save screenshots, the robots.txt file, the Spawning confirmation email). If you can establish that the company trained on your audio despite the opt-out, you have a strong case for a DMCA takedown of the model card or a copyright infringement claim. Several class actions in 2025-2026 included defendants who had ignored opt-outs, and those cases settled for higher amounts because the opt-out violation established willful infringement. The documentation is your evidence.
Is Nightshade legal to use?
In the US and EU, Nightshade is legal to use on your own audio as of mid-2026. The legal theory: you have the right to modify your own copyrighted audio in any way that does not violate the law, and adding perturbations to your audio before release is a form of self-defense that does not harm any third party. The legal theory is being tested in court. If you are in a jurisdiction with less clear precedent, or if you are concerned about specific contractual obligations (some label contracts restrict audio modification), consult a lawyer before applying Nightshade to releases under contract.