Training Data Provenance in M&A: How Acquirers Assess the Risk | WCR Legal

Training Data Provenance in M&A: How Acquirers Assess the Risk

AI Law · M&A Due Diligence

Training Data Provenance in M&A: How Acquirers Assess the Risk

An AI company’s training data is its most valuable and most legally opaque asset. Unlicensed data discovered post-acquisition becomes the acquirer’s liability. Here is how sophisticated acquirers and M&A counsel assess training data risk before close.

Post-acquisition IP liability 5 data type risk profiles Customer data = critical risk DD checklist + remedy options Escrow · price adj · R&W
Contents 6 sections
1
Why Provenance Is the Core Risk
The post-acquisition liability chain
2
Risk by Data Type
Interactive 5-category risk profiles
3
What to Request in DD
Data room checklist — 8 items
4
If You Find a Problem
Escrow, price adj, rep & warranty
5
FAQ
M&A counsel questions answered
6
Commission a DD Review
Provenance audit before close
The Provenance Problem

Why Training Data Provenance Is the Core M&A Risk

Standard IP due diligence checks whether the target company owns its code. Training data provenance asks a more difficult question: does the company have the right to use that data for training — commercially, at scale, across all the jurisdictions where the model is deployed? These are fundamentally different questions, and standard IP review does not answer the second one. Our full AI DD framework is outlined in the AI Due Diligence service overview.

Acquirer Liability Warning
When you acquire an AI company, you acquire its training data liability in full. Unlicensed data used to train the model does not become licensed at closing. Third-party rights-holders retain their copyright claims against the model — and those claims now belong against the acquirer.
Why Training Data Provenance Defines M&A Risk
The liability chain from training data to post-acquisition exposure
4 Reasons
1
IP Liability
The Model Embeds the Data’s Legal Status
A trained model is not independent of its training data. Copyright law in key jurisdictions — the US, EU, and UK — treats the reproduction of data during training as a copyright act. If the training data was not lawfully obtained or used, that legal defect is embedded in every model version derived from it. You cannot “clean” a model of its provenance problems post-training without retraining from scratch.
Getty Images v. Stability AI (UK) — ongoing claim for mass infringement via training data
New York Times v. OpenAI — memorisation and reproduction as infringement evidence
EU Copyright in the Digital Single Market Directive — TDM exception requires lawful access to source data
2
Successor Liability
Acquisition Does Not Reset Third-Party Claims
When an acquirer takes ownership of an AI company, it takes ownership of all pending and contingent liabilities, including unasserted copyright claims against the training data. The original rights-holders — news publishers, image agencies, book authors, software developers — retain their claims. The only change is that those claims are now enforceable against a better-capitalised defendant: the acquirer.
Acquisition triggers increased claim exposure: larger entity = larger damages calculation
W&I insurance for AI acquisitions requires provenance review as a condition of coverage
Reps and warranties from seller are only as good as the seller’s ability to fund a claim
3
GDPR Risk
Personal Data in Training Sets Creates GDPR Exposure
If the training dataset contains personal data — and for large internet-scraped datasets it almost certainly does — the acquirer inherits the GDPR compliance status of that data processing. Unlawful processing of personal data for AI training purposes has been the subject of enforcement action across multiple EU jurisdictions. Post-acquisition, the acquirer becomes the data controller for those purposes and is directly exposed to supervisory authority action.
ChatGPT/OpenAI — Italian DPA temporary ban (2023) for GDPR training data violations
Purpose limitation: personal data collected for one purpose cannot be repurposed for AI training without a new lawful basis
GDPR Art. 5(1)(b) purpose limitation — central issue in regulatory investigations of AI training data
4
Valuation Impact
Provenance Gaps Discount Valuation and Block W&I Coverage
Training data provenance gaps are not merely legal inconveniences — they have direct valuation consequences. W&I insurers for AI transactions now routinely exclude coverage for training data IP claims where provenance has not been verified. This means that post-close claims for unlicensed training data sit entirely with the acquirer, without insurance backstop. Sophisticated acquirers use this exposure to negotiate price adjustments or escrow arrangements before close.
W&I insurance exclusion for AI training data is now standard in major underwriting markets
Provenance-verified datasets command a premium; unverified datasets justify escrow holdbacks
Re-training cost is the floor for valuation discount; litigation exposure is the ceiling
Risk Profiles

Risk by Data Type: 5-Category Assessment

Training data risk is not uniform. Select a data category to see its risk profile, the key DD questions it triggers, and the red flags that require escalation.

Training Data Risk by Category
Select a data type to see its risk profile and DD questions
Proprietary /
Collected
Licensed
Third-Party
Open Internet
Scraping
Synthetic
Data
Customer
Data
Low Risk Data collected directly by the company from its own users or operations with appropriate consent and GDPR compliance. Lowest baseline risk, but consent scope and purpose must be verified.
Key DD Questions
What consents were obtained at point of collection?
Was AI training listed as an explicit purpose in the privacy notice?
Are data collection records complete and auditable?
Has any data subject withdrawn consent post-collection?
Is the data stored in a jurisdiction with data localisation requirements?
Red Flags to Watch For
Consent records missing or incomplete for pre-2018 dataFlag
Privacy notices that do not mention AI training as a processing purposeFlag
Children’s data included in training sets — different consent requirements under GDPR Art. 8
Data collected under a different legal entity that was not formally transferred to the target
Medium Risk Data obtained under formal licence agreements from third-party providers. Risk depends on licence scope, permitted uses, and whether change-of-control clauses apply at acquisition.
Key DD Questions
Does the licence explicitly permit use for AI model training?
Are there change-of-control provisions in the data licence?
Is the licence perpetual or time-limited? What happens at renewal?
Are there restrictions on the markets or verticals where the trained model can be deployed?
Does the licence cover sublicensing to the acquirer group?
Red Flags to Watch For
Licence silent on AI training — “analytics” use rights do not cover model trainingFlag
Change-of-control clause triggers termination or re-pricing on acquisitionFlag
Licence granted to a specific entity that will be dissolved or restructured post-close
No-sublicensing clause that prevents transfer within the acquirer group
High Risk Data collected by automated scraping of publicly accessible web content. Highest litigation exposure category. No “public” website is freely usable for commercial AI training without analysis of ToS and copyright.
Key DD Questions
What sources were scraped and when?
Were robots.txt restrictions observed during collection?
What ToS analysis was conducted for each scraped source?
Did scraping include content from jurisdictions with database rights (EU)?
Has the company received any cease-and-desist notices or legal threats?
Red Flags to Watch For
No ToS review conducted — mass scraping without legal analysis is presumptively infringingHigh
News publisher, stock image, or book content included — active litigation categoryHigh
Any robots.txt violations — relevant to terms of service breach and potential Computer Fraud and Abuse Act exposure (US)
No opt-out mechanism or removal process for rights-holders who request exclusionFlag
Low–Medium Risk Data generated algorithmically or by a model to augment or replace real data. Risk is lower than scraped data but not zero: the provenance of the seed data used to generate synthetic data must be traced independently.
Key DD Questions
What seed data was used to generate the synthetic dataset?
What model was used to generate synthetic data — and under what licence?
Can the synthetic data be shown to not reproduce original works (memorisation risk)?
Are there any terms of service restrictions on the generative model used for synthesis?
Is the synthetic data provably distinct from the seed data at statistical level?
Red Flags to Watch For
Synthetic data generated from a model whose ToS prohibits use of outputs for training downstream modelsFlag
No documentation of which model version generated the data — untraceable provenance chain
Seed data for synthesis was itself unlicensed scraping (inherits the scraping risk category)
High memorisation rate in quality testing — synthetic data that reproduces originals is not insulated from copyright claims
Critical Risk Data provided by customers under a SaaS or service agreement and repurposed for AI model training. The highest-risk category: requires explicit contractual consent, specific GDPR lawful basis, and strict purpose limitation compliance.
Key DD Questions
Does each customer MSA explicitly permit use of their data for AI training?
Is the training use disclosed in the DPA and privacy notice provided to customers?
Have any customers objected to or opted out of training data use?
What is the GDPR lawful basis for processing customer personal data for training?
Are enterprise customers in regulated sectors (finance, health, legal) where training use creates additional obligations?
Red Flags to Watch For
Standard MSA contains no AI training permission — “service improvement” clauses are insufficientCritical
Training use first introduced by ToS update — users who did not affirmatively accept are not coveredCritical
Healthcare, legal, or financial services customer data used for training without sector-specific compliance reviewCritical
Any customer complaints or DPA subject access requests citing training use as the basis
Customer Data Guidance
For a detailed analysis of the GDPR conditions that must be satisfied before customer data can be used for AI training, including purpose limitation, lawful basis, and contractual consent requirements, see our guide: Can I Use Customer Data to Train My AI Model?
Data Room Checklist

What to Request in DD: Training Data Provenance Checklist

These are the specific items to request from the target in the data room. Generic IP questionnaires do not cover these. Request them explicitly in the DD questionnaire, in addition to standard IP and contract requests.

Training Data Provenance — DD Request List
8 items to request from target — click to mark received
0 / 8
Data Inventory & Sources
Complete training data register — all datasets used, by name, version, source and collection date
Should itemise each dataset separately. Generic “proprietary data” descriptions are insufficient.
Priority
Source documentation for each dataset — scraping logs, licence agreements, consent records, API terms
One document per dataset source. Scraping: logs with URLs, dates, robots.txt compliance. Licensed: the licence agreement itself.
Priority
Any legal opinions or external counsel advice obtained on training data rights
Existence of prior counsel review is both substantive evidence and an indicator of risk awareness. Request all memos, opinions and correspondence.
Important
Customer Data & GDPR
All customer MSAs and DPAs — review for AI training permission clauses
Identify whether AI training use is explicitly permitted, silent, or excluded. Flag any customers who have opted out.
Priority
Privacy notices, cookie policies and consent records — confirm AI training is a disclosed processing purpose
GDPR purpose limitation requires that training use be disclosed at point of collection. Historic notices are as important as current ones.
GDPR
Licences, Claims & Synthetic Data
All third-party data licence agreements — review for change-of-control, training scope and sublicensing rights
Pay particular attention to whether the licence permits commercial AI training and whether it survives an acquisition.
Contracts
Correspondence, notices or claims from any rights-holders regarding training data use
Includes cease-and-desist letters, subject access requests citing training use, opt-out requests, and any pre-litigation correspondence.
Claims
Synthetic data documentation — seed data sources, generative model used and its ToS, memorisation testing results
If synthetic data was generated using a third-party model (GPT-4, Claude, etc.) check whether training downstream models on those outputs is permitted under the provider’s terms.
Synthetic
If You Find a Problem

Dealing With a Provenance Gap: Three Structuring Options

Discovering a training data provenance gap during DD does not automatically kill a deal — but it must be addressed before signing. There are three main structuring mechanisms. The right choice depends on the severity and quantifiability of the exposure.

01
Risk Allocation
Escrow Arrangement
A portion of the purchase price is held in escrow post-close and released upon satisfaction of agreed conditions — typically either the passage of a limitation period without claims, or the completion of a data remediation exercise. The escrow amount is sized to cover the estimated cost of re-training the model on a clean dataset, plus reasonable litigation defence costs for any claims that emerge in the escrow period.
Use When
Provenance gap is identified and quantifiable but remediation is feasible post-close. Typical for limited scraping exposure or a discrete dataset with unclear rights.
02
Valuation
Price Adjustment
The acquisition price is reduced to reflect the cost of the provenance gap, calculated as a discount to the model’s attributable value. The discount is typically negotiated with reference to: the estimated cost of re-training on a compliant dataset, the probability and quantum of third-party claims, and the premium that a provenance-clean dataset would command in the market. Price adjustments are cleaner than escrow where the parties can agree on the quantum, and avoid the administrative complexity of an escrow agent.
Use When
Exposure is quantifiable and both parties prefer certainty at close over a post-close holdback mechanism. Best for moderate scraping exposure with estimable re-training costs.
03
Legal Protection
Representations, Warranties & W&I Insurance
The SPA includes specific representations and warranties from the seller regarding training data provenance, lawful collection, licence scope and absence of third-party claims. The seller’s liability for breach is backed by its indemnity obligations and, where available, by warranty and indemnity (W&I) insurance. Note that W&I insurers now routinely exclude training data IP claims unless a provenance review has been conducted and disclosed. Reps and warranties alone, without a provenance audit, provide limited practical protection.
Use When
Provenance review has been conducted and confirms good practices, but residual uncertainty remains. Reps and warranties capture the “unknown unknowns” where audit cannot go. Always supplement with, not substitute for, a provenance audit.
See Also
For the full spectrum of AI-specific legal due diligence — beyond training data to model licences, API change-of-control and EU AI Act classification — see AI-Specific Legal Due Diligence: What Standard Tech DD Misses.
Running a training data provenance review on a live deal? WCR Legal delivers structured provenance audits for M&A transactions, with findings mapped to deal structure options.
Commission a provenance audit →
Frequently Asked Questions
Training data provenance in M&A — 5 questions answered
1
Can a provenance gap discovered in DD kill the deal entirely?
+

Yes — in some circumstances. A deal-killing provenance gap is typically one where: (a) the training data contains personal data of EU residents processed without lawful basis, exposing the acquirer to immediate regulatory action post-close; (b) the data was sourced from rights-holders who are currently in active litigation against AI companies, creating near-certain claim exposure; or (c) the cost of re-training on a clean dataset exceeds the attributed value of the model, meaning the acquirer is effectively buying a compliance liability rather than an asset.

More commonly, provenance gaps are deal-conditioning rather than deal-killing — they require price adjustment, escrow or additional representations, but the transaction proceeds.

2
Is there a safe harbour for AI training under EU copyright law?
+

Partially. The EU Copyright in the Digital Single Market Directive (DSM Directive, 2019/790) provides a Text and Data Mining (TDM) exception under Article 4 for commercial uses, subject to rights-holders’ opt-out. However, the safe harbour only applies where the source data was lawfully accessed in the first place. Scraping data in violation of website ToS does not create “lawful access.”

The opt-out mechanism also matters significantly. If major rights-holders — news publishers, image agencies, book publishers — had posted a machine-readable opt-out (per Article 4(3)) before the training data was collected, using that data for commercial training is not covered by the exception. This is a key issue in the current wave of AI copyright litigation and cannot be assumed away in provenance review.

3
How much does a training data provenance audit cost and how long does it take?
+

Cost and timeline depend on the complexity of the target’s training data stack. A focused provenance review on a single-model AI company with two or three clearly documented datasets can typically be completed within a standard DD timeline (10–15 business days). A multi-model company with large scraped datasets, multiple data vendors and extensive customer data usage requires a longer, more structured audit.

The cost of a provenance audit is almost always a small fraction of the cost of discovering a provenance problem post-close. W&I insurers increasingly require a provenance review as a condition of offering coverage for AI transactions, which means the audit is not optional for properly structured acquisitions.

4
What if the seller cannot produce documentation for historic training data?
+

The absence of documentation is itself a significant finding. An AI company that cannot produce its training data register, collection logs, or data licences for historic datasets is either disorganised or concealing something. In DD terms, the appropriate response is to treat undocumented datasets as potentially high-risk and structure the deal accordingly — either with escrow coverage for undocumented data, specific SPA representations that the seller warrants the absence of third-party rights in the undocumented datasets, or a price reduction that reflects the uncertainty.

Do not accept verbal assurances that “the data was all properly licenced” without documentary evidence. That assurance is worthless if a third-party claim emerges post-close.

5
Does the risk profile change if the target uses open-source foundation models?
+

Not entirely. Using an open-source base model (Llama, Mistral, Falcon) transfers some risk from training data provenance to upstream licence terms — but both risks remain. The open-source model itself was trained on data with its own provenance; if that upstream data had provenance problems, those problems are embedded in any model fine-tuned on top. Additionally, open-source AI licences vary significantly: some are permissive (Apache 2.0), some carry commercial use restrictions (Llama community licence), and some impose conditions on how outputs can be used commercially.

The provenance audit for a fine-tuned open-source model must cover: (a) the fine-tuning dataset’s own provenance, (b) the base model’s licence terms and any restrictions on commercial fine-tuning, and (c) whether the base model’s licence survives an acquisition. For the full analysis of who owns AI outputs and the implications for acquirers, see our article on who owns AI outputs.

AI Law · M&A Due Diligence

Training Data Provenance Audit Before You Close

WCR Legal conducts structured training data provenance audits for AI acquisitions — covering all five data type categories, mapped to deal structure options. Findings are delivered in a format suitable for SPA representations and W&I insurance applications.

Oleg Prosin is the Managing Partner at WCR Legal, focusing on international business structuring, regulatory frameworks for FinTech companies, digital assets, and licensing regimes across various jurisdictions. Works with founders and investment firms on compliance, operating models, and cross-border expansion strategies.

Post Comment