AI Law Investment Structuring AI acquisition due diligence, AI M&A risk, GDPR AI training, training data IP liability, training data provenance M&A, unlicensed training data Oleg Prosin May 21, 2026 0 Comments

Training Data Provenance in M&A: How Acquirers Assess the Risk

AI Law · M&A Due Diligence

Training Data Provenance in M&A: How Acquirers Assess the Risk

Q: Can a provenance gap discovered in DD kill the deal entirely?

Yes, in some circumstances — where the gap involves EU personal data processed without lawful basis, active rights-holder litigation, or re-training costs that exceed the model's attributed value. More commonly, provenance gaps are deal-conditioning: they require price adjustment, escrow or additional representations, but the transaction proceeds.

Q: Is there a safe harbour for AI training under EU copyright law?

Partially. The DSM Directive provides a TDM exception under Article 4 for commercial uses, subject to rights-holders' opt-out. However, the safe harbour only applies where source data was lawfully accessed. Scraping in violation of website ToS does not create lawful access. If major rights-holders had posted a machine-readable opt-out before the data was collected, using that data is not covered by the exception.

Q: How much does a training data provenance audit cost and how long does it take?

A focused review on a single-model company with documented datasets can typically be completed within a standard DD timeline of 10–15 business days. Multi-model companies with large scraped datasets require longer. The cost is almost always a fraction of discovering a provenance problem post-close. W&I insurers increasingly require a provenance review as a condition of offering coverage for AI transactions.

Q: What if the seller cannot produce documentation for historic training data?

Absence of documentation is itself a significant finding. Treat undocumented datasets as potentially high-risk and structure the deal accordingly — with escrow coverage, specific SPA representations, or a price reduction reflecting the uncertainty. Do not accept verbal assurances without documentary evidence.

Q: Does the risk profile change if the target uses open-source foundation models?

Not entirely. The fine-tuning dataset has its own provenance, and the base model's licence terms must be reviewed for commercial fine-tuning rights and acquisition survival. A provenance audit for a fine-tuned open-source model must cover the fine-tuning dataset, the base model's licence, and whether that licence survives an acquisition.

An AI company’s training data is its most valuable and most legally opaque asset. Unlicensed data discovered post-acquisition becomes the acquirer’s liability. Here is how sophisticated acquirers and M&A counsel assess training data risk before close.

Post-acquisition IP liability 5 data type risk profiles Customer data = critical risk DD checklist + remedy options Escrow · price adj · R&W

Contents 6 sections

1

Why Provenance Is the Core Risk

The post-acquisition liability chain

2

Risk by Data Type

Interactive 5-category risk profiles

3

What to Request in DD

Data room checklist — 8 items

4

If You Find a Problem

Escrow, price adj, rep & warranty

5

FAQ

M&A counsel questions answered

6

Commission a DD Review

Provenance audit before close

The Provenance Problem

Why Training Data Provenance Is the Core M&A Risk

Standard IP due diligence checks whether the target company owns its code. Training data provenance asks a more difficult question: does the company have the right to use that data for training — commercially, at scale, across all the jurisdictions where the model is deployed? These are fundamentally different questions, and standard IP review does not answer the second one. Our full AI DD framework is outlined in the AI Due Diligence service overview.

Acquirer Liability Warning

When you acquire an AI company, you acquire its training data liability in full. Unlicensed data used to train the model does not become licensed at closing. Third-party rights-holders retain their copyright claims against the model — and those claims now belong against the acquirer.

Why Training Data Provenance Defines M&A Risk

The liability chain from training data to post-acquisition exposure

4 Reasons

1

IP Liability

The Model Embeds the Data’s Legal Status

A trained model is not independent of its training data. Copyright law in key jurisdictions — the US, EU, and UK — treats the reproduction of data during training as a copyright act. If the training data was not lawfully obtained or used, that legal defect is embedded in every model version derived from it. You cannot “clean” a model of its provenance problems post-training without retraining from scratch.

Getty Images v. Stability AI (UK) — ongoing claim for mass infringement via training data

New York Times v. OpenAI — memorisation and reproduction as infringement evidence

EU Copyright in the Digital Single Market Directive — TDM exception requires lawful access to source data

2

Successor Liability

Acquisition Does Not Reset Third-Party Claims

When an acquirer takes ownership of an AI company, it takes ownership of all pending and contingent liabilities, including unasserted copyright claims against the training data. The original rights-holders — news publishers, image agencies, book authors, software developers — retain their claims. The only change is that those claims are now enforceable against a better-capitalised defendant: the acquirer.

Acquisition triggers increased claim exposure: larger entity = larger damages calculation

W&I insurance for AI acquisitions requires provenance review as a condition of coverage

Reps and warranties from seller are only as good as the seller’s ability to fund a claim

3

GDPR Risk

Personal Data in Training Sets Creates GDPR Exposure

If the training dataset contains personal data — and for large internet-scraped datasets it almost certainly does — the acquirer inherits the GDPR compliance status of that data processing. Unlawful processing of personal data for AI training purposes has been the subject of enforcement action across multiple EU jurisdictions. Post-acquisition, the acquirer becomes the data controller for those purposes and is directly exposed to supervisory authority action.

ChatGPT/OpenAI — Italian DPA temporary ban (2023) for GDPR training data violations

Purpose limitation: personal data collected for one purpose cannot be repurposed for AI training without a new lawful basis

GDPR Art. 5(1)(b) purpose limitation — central issue in regulatory investigations of AI training data

4

Valuation Impact

Provenance Gaps Discount Valuation and Block W&I Coverage

Training data provenance gaps are not merely legal inconveniences — they have direct valuation consequences. W&I insurers for AI transactions now routinely exclude coverage for training data IP claims where provenance has not been verified. This means that post-close claims for unlicensed training data sit entirely with the acquirer, without insurance backstop. Sophisticated acquirers use this exposure to negotiate price adjustments or escrow arrangements before close.

W&I insurance exclusion for AI training data is now standard in major underwriting markets

Provenance-verified datasets command a premium; unverified datasets justify escrow holdbacks

Re-training cost is the floor for valuation discount; litigation exposure is the ceiling

Risk Profiles

Risk by Data Type: 5-Category Assessment

Training data risk is not uniform. Select a data category to see its risk profile, the key DD questions it triggers, and the red flags that require escalation.

Training Data Risk by Category

Select a data type to see its risk profile and DD questions

Proprietary /
Collected

Licensed
Third-Party

Open Internet
Scraping

Synthetic
Data

Customer
Data

Low Risk Data collected directly by the company from its own users or operations with appropriate consent and GDPR compliance. Lowest baseline risk, but consent scope and purpose must be verified.

Key DD Questions

What consents were obtained at point of collection?

Was AI training listed as an explicit purpose in the privacy notice?

Are data collection records complete and auditable?

Has any data subject withdrawn consent post-collection?

Is the data stored in a jurisdiction with data localisation requirements?

Red Flags to Watch For

Consent records missing or incomplete for pre-2018 dataFlag

Privacy notices that do not mention AI training as a processing purposeFlag

Children’s data included in training sets — different consent requirements under GDPR Art. 8

Data collected under a different legal entity that was not formally transferred to the target

Medium Risk Data obtained under formal licence agreements from third-party providers. Risk depends on licence scope, permitted uses, and whether change-of-control clauses apply at acquisition.

Key DD Questions

Does the licence explicitly permit use for AI model training?

Are there change-of-control provisions in the data licence?

Is the licence perpetual or time-limited? What happens at renewal?

Are there restrictions on the markets or verticals where the trained model can be deployed?

Does the licence cover sublicensing to the acquirer group?

Red Flags to Watch For

Licence silent on AI training — “analytics” use rights do not cover model trainingFlag

Change-of-control clause triggers termination or re-pricing on acquisitionFlag

Licence granted to a specific entity that will be dissolved or restructured post-close

No-sublicensing clause that prevents transfer within the acquirer group

High Risk Data collected by automated scraping of publicly accessible web content. Highest litigation exposure category. No “public” website is freely usable for commercial AI training without analysis of ToS and copyright.

Key DD Questions

What sources were scraped and when?

Were robots.txt restrictions observed during collection?

What ToS analysis was conducted for each scraped source?

Did scraping include content from jurisdictions with database rights (EU)?

Has the company received any cease-and-desist notices or legal threats?

Red Flags to Watch For

No ToS review conducted — mass scraping without legal analysis is presumptively infringingHigh

News publisher, stock image, or book content included — active litigation categoryHigh

Any robots.txt violations — relevant to terms of service breach and potential Computer Fraud and Abuse Act exposure (US)

No opt-out mechanism or removal process for rights-holders who request exclusionFlag

Low–Medium Risk Data generated algorithmically or by a model to augment or replace real data. Risk is lower than scraped data but not zero: the provenance of the seed data used to generate synthetic data must be traced independently.

Key DD Questions

What seed data was used to generate the synthetic dataset?

What model was used to generate synthetic data — and under what licence?

Can the synthetic data be shown to not reproduce original works (memorisation risk)?

Are there any terms of service restrictions on the generative model used for synthesis?

Is the synthetic data provably distinct from the seed data at statistical level?

Red Flags to Watch For

Synthetic data generated from a model whose ToS prohibits use of outputs for training downstream modelsFlag

No documentation of which model version generated the data — untraceable provenance chain

Seed data for synthesis was itself unlicensed scraping (inherits the scraping risk category)

High memorisation rate in quality testing — synthetic data that reproduces originals is not insulated from copyright claims

Critical Risk Data provided by customers under a SaaS or service agreement and repurposed for AI model training. The highest-risk category: requires explicit contractual consent, specific GDPR lawful basis, and strict purpose limitation compliance.

Key DD Questions

Does each customer MSA explicitly permit use of their data for AI training?

Is the training use disclosed in the DPA and privacy notice provided to customers?

Have any customers objected to or opted out of training data use?

What is the GDPR lawful basis for processing customer personal data for training?

Are enterprise customers in regulated sectors (finance, health, legal) where training use creates additional obligations?

Red Flags to Watch For

Standard MSA contains no AI training permission — “service improvement” clauses are insufficientCritical

Training use first introduced by ToS update — users who did not affirmatively accept are not coveredCritical

Healthcare, legal, or financial services customer data used for training without sector-specific compliance reviewCritical

Any customer complaints or DPA subject access requests citing training use as the basis

Customer Data Guidance

For a detailed analysis of the GDPR conditions that must be satisfied before customer data can be used for AI training, including purpose limitation, lawful basis, and contractual consent requirements, see our guide: Can I Use Customer Data to Train My AI Model?

Data Room Checklist

What to Request in DD: Training Data Provenance Checklist

These are the specific items to request from the target in the data room. Generic IP questionnaires do not cover these. Request them explicitly in the DD questionnaire, in addition to standard IP and contract requests.

Training Data Provenance — DD Request List

8 items to request from target — click to mark received

0 / 8

Data Inventory & Sources

✓

Complete training data register — all datasets used, by name, version, source and collection date

Should itemise each dataset separately. Generic “proprietary data” descriptions are insufficient.

Priority

✓

Source documentation for each dataset — scraping logs, licence agreements, consent records, API terms

One document per dataset source. Scraping: logs with URLs, dates, robots.txt compliance. Licensed: the licence agreement itself.

Priority

✓

Any legal opinions or external counsel advice obtained on training data rights

Existence of prior counsel review is both substantive evidence and an indicator of risk awareness. Request all memos, opinions and correspondence.

Important

Customer Data & GDPR

✓

All customer MSAs and DPAs — review for AI training permission clauses

Identify whether AI training use is explicitly permitted, silent, or excluded. Flag any customers who have opted out.

Priority

✓

Privacy notices, cookie policies and consent records — confirm AI training is a disclosed processing purpose

GDPR purpose limitation requires that training use be disclosed at point of collection. Historic notices are as important as current ones.

GDPR

Licences, Claims & Synthetic Data

✓

All third-party data licence agreements — review for change-of-control, training scope and sublicensing rights

Pay particular attention to whether the licence permits commercial AI training and whether it survives an acquisition.

Contracts

✓

Correspondence, notices or claims from any rights-holders regarding training data use

Includes cease-and-desist letters, subject access requests citing training use, opt-out requests, and any pre-litigation correspondence.

Claims

✓

Synthetic data documentation — seed data sources, generative model used and its ToS, memorisation testing results

If synthetic data was generated using a third-party model (GPT-4, Claude, etc.) check whether training downstream models on those outputs is permitted under the provider’s terms.

Synthetic

If You Find a Problem

Dealing With a Provenance Gap: Three Structuring Options

Discovering a training data provenance gap during DD does not automatically kill a deal — but it must be addressed before signing. There are three main structuring mechanisms. The right choice depends on the severity and quantifiability of the exposure.

01

Risk Allocation

Escrow Arrangement

A portion of the purchase price is held in escrow post-close and released upon satisfaction of agreed conditions — typically either the passage of a limitation period without claims, or the completion of a data remediation exercise. The escrow amount is sized to cover the estimated cost of re-training the model on a clean dataset, plus reasonable litigation defence costs for any claims that emerge in the escrow period.

Use When

Provenance gap is identified and quantifiable but remediation is feasible post-close. Typical for limited scraping exposure or a discrete dataset with unclear rights.

02

Valuation

Price Adjustment

The acquisition price is reduced to reflect the cost of the provenance gap, calculated as a discount to the model’s attributable value. The discount is typically negotiated with reference to: the estimated cost of re-training on a compliant dataset, the probability and quantum of third-party claims, and the premium that a provenance-clean dataset would command in the market. Price adjustments are cleaner than escrow where the parties can agree on the quantum, and avoid the administrative complexity of an escrow agent.

Use When

Exposure is quantifiable and both parties prefer certainty at close over a post-close holdback mechanism. Best for moderate scraping exposure with estimable re-training costs.

03

Legal Protection

Representations, Warranties & W&I Insurance

The SPA includes specific representations and warranties from the seller regarding training data provenance, lawful collection, licence scope and absence of third-party claims. The seller’s liability for breach is backed by its indemnity obligations and, where available, by warranty and indemnity (W&I) insurance. Note that W&I insurers now routinely exclude training data IP claims unless a provenance review has been conducted and disclosed. Reps and warranties alone, without a provenance audit, provide limited practical protection.

Use When

Provenance review has been conducted and confirms good practices, but residual uncertainty remains. Reps and warranties capture the “unknown unknowns” where audit cannot go. Always supplement with, not substitute for, a provenance audit.

See Also

For the full spectrum of AI-specific legal due diligence — beyond training data to model licences, API change-of-control and EU AI Act classification — see AI-Specific Legal Due Diligence: What Standard Tech DD Misses.

Running a training data provenance review on a live deal? WCR Legal delivers structured provenance audits for M&A transactions, with findings mapped to deal structure options.

Commission a provenance audit →

Frequently Asked Questions

Training data provenance in M&A — 5 questions answered

1

Can a provenance gap discovered in DD kill the deal entirely?

+

Yes — in some circumstances. A deal-killing provenance gap is typically one where: (a) the training data contains personal data of EU residents processed without lawful basis, exposing the acquirer to immediate regulatory action post-close; (b) the data was sourced from rights-holders who are currently in active litigation against AI companies, creating near-certain claim exposure; or (c) the cost of re-training on a clean dataset exceeds the attributed value of the model, meaning the acquirer is effectively buying a compliance liability rather than an asset.

More commonly, provenance gaps are deal-conditioning rather than deal-killing — they require price adjustment, escrow or additional representations, but the transaction proceeds.

2

Is there a safe harbour for AI training under EU copyright law?

+

Partially. The EU Copyright in the Digital Single Market Directive (DSM Directive, 2019/790) provides a Text and Data Mining (TDM) exception under Article 4 for commercial uses, subject to rights-holders’ opt-out. However, the safe harbour only applies where the source data was lawfully accessed in the first place. Scraping data in violation of website ToS does not create “lawful access.”

The opt-out mechanism also matters significantly. If major rights-holders — news publishers, image agencies, book publishers — had posted a machine-readable opt-out (per Article 4(3)) before the training data was collected, using that data for commercial training is not covered by the exception. This is a key issue in the current wave of AI copyright litigation and cannot be assumed away in provenance review.

3

How much does a training data provenance audit cost and how long does it take?

+

Cost and timeline depend on the complexity of the target’s training data stack. A focused provenance review on a single-model AI company with two or three clearly documented datasets can typically be completed within a standard DD timeline (10–15 business days). A multi-model company with large scraped datasets, multiple data vendors and extensive customer data usage requires a longer, more structured audit.

The cost of a provenance audit is almost always a small fraction of the cost of discovering a provenance problem post-close. W&I insurers increasingly require a provenance review as a condition of offering coverage for AI transactions, which means the audit is not optional for properly structured acquisitions.

4

What if the seller cannot produce documentation for historic training data?

+

The absence of documentation is itself a significant finding. An AI company that cannot produce its training data register, collection logs, or data licences for historic datasets is either disorganised or concealing something. In DD terms, the appropriate response is to treat undocumented datasets as potentially high-risk and structure the deal accordingly — either with escrow coverage for undocumented data, specific SPA representations that the seller warrants the absence of third-party rights in the undocumented datasets, or a price reduction that reflects the uncertainty.

Do not accept verbal assurances that “the data was all properly licenced” without documentary evidence. That assurance is worthless if a third-party claim emerges post-close.

5

Does the risk profile change if the target uses open-source foundation models?

+

Not entirely. Using an open-source base model (Llama, Mistral, Falcon) transfers some risk from training data provenance to upstream licence terms — but both risks remain. The open-source model itself was trained on data with its own provenance; if that upstream data had provenance problems, those problems are embedded in any model fine-tuned on top. Additionally, open-source AI licences vary significantly: some are permissive (Apache 2.0), some carry commercial use restrictions (Llama community licence), and some impose conditions on how outputs can be used commercially.

The provenance audit for a fine-tuned open-source model must cover: (a) the fine-tuning dataset’s own provenance, (b) the base model’s licence terms and any restrictions on commercial fine-tuning, and (c) whether the base model’s licence survives an acquisition. For the full analysis of who owns AI outputs and the implications for acquirers, see our article on who owns AI outputs.

AI Due Diligence Service Customer Data for AI Training AI DD Checklist Change-of-Control AI Contracts Who Owns AI Outputs

AI Law · M&A Due Diligence

Training Data Provenance Audit Before You Close

WCR Legal conducts structured training data provenance audits for AI acquisitions — covering all five data type categories, mapped to deal structure options. Findings are delivered in a format suitable for SPA representations and W&I insurance applications.

Commission a provenance audit → AI DD service overview

Training Data Provenance in M&A: How Acquirers Assess the Risk

Why Training Data Provenance Is the Core M&A Risk

Risk by Data Type: 5-Category Assessment

What to Request in DD: Training Data Provenance Checklist

Dealing With a Provenance Gap: Three Structuring Options

Training Data Provenance Audit Before You Close

AI-Specific Legal Due Diligence: What Standard Tech DD Misses

Change-of-Control Clauses in AI Contracts: What Blocks Acquisitions

Related Posts

Post Comment Cancel reply