Training Data Provenance in M&A: How Acquirers Assess the Risk
Training Data Provenance in M&A: How Acquirers Assess the Risk
An AI company’s training data is its most valuable and most legally opaque asset. Unlicensed data discovered post-acquisition becomes the acquirer’s liability. Here is how sophisticated acquirers and M&A counsel assess training data risk before close.
Why Training Data Provenance Is the Core M&A Risk
Standard IP due diligence checks whether the target company owns its code. Training data provenance asks a more difficult question: does the company have the right to use that data for training — commercially, at scale, across all the jurisdictions where the model is deployed? These are fundamentally different questions, and standard IP review does not answer the second one. Our full AI DD framework is outlined in the AI Due Diligence service overview.
Risk by Data Type: 5-Category Assessment
Training data risk is not uniform. Select a data category to see its risk profile, the key DD questions it triggers, and the red flags that require escalation.
Collected
Third-Party
Scraping
Data
Data
What to Request in DD: Training Data Provenance Checklist
These are the specific items to request from the target in the data room. Generic IP questionnaires do not cover these. Request them explicitly in the DD questionnaire, in addition to standard IP and contract requests.
Dealing With a Provenance Gap: Three Structuring Options
Discovering a training data provenance gap during DD does not automatically kill a deal — but it must be addressed before signing. There are three main structuring mechanisms. The right choice depends on the severity and quantifiability of the exposure.
Yes — in some circumstances. A deal-killing provenance gap is typically one where: (a) the training data contains personal data of EU residents processed without lawful basis, exposing the acquirer to immediate regulatory action post-close; (b) the data was sourced from rights-holders who are currently in active litigation against AI companies, creating near-certain claim exposure; or (c) the cost of re-training on a clean dataset exceeds the attributed value of the model, meaning the acquirer is effectively buying a compliance liability rather than an asset.
More commonly, provenance gaps are deal-conditioning rather than deal-killing — they require price adjustment, escrow or additional representations, but the transaction proceeds.
Partially. The EU Copyright in the Digital Single Market Directive (DSM Directive, 2019/790) provides a Text and Data Mining (TDM) exception under Article 4 for commercial uses, subject to rights-holders’ opt-out. However, the safe harbour only applies where the source data was lawfully accessed in the first place. Scraping data in violation of website ToS does not create “lawful access.”
The opt-out mechanism also matters significantly. If major rights-holders — news publishers, image agencies, book publishers — had posted a machine-readable opt-out (per Article 4(3)) before the training data was collected, using that data for commercial training is not covered by the exception. This is a key issue in the current wave of AI copyright litigation and cannot be assumed away in provenance review.
Cost and timeline depend on the complexity of the target’s training data stack. A focused provenance review on a single-model AI company with two or three clearly documented datasets can typically be completed within a standard DD timeline (10–15 business days). A multi-model company with large scraped datasets, multiple data vendors and extensive customer data usage requires a longer, more structured audit.
The cost of a provenance audit is almost always a small fraction of the cost of discovering a provenance problem post-close. W&I insurers increasingly require a provenance review as a condition of offering coverage for AI transactions, which means the audit is not optional for properly structured acquisitions.
The absence of documentation is itself a significant finding. An AI company that cannot produce its training data register, collection logs, or data licences for historic datasets is either disorganised or concealing something. In DD terms, the appropriate response is to treat undocumented datasets as potentially high-risk and structure the deal accordingly — either with escrow coverage for undocumented data, specific SPA representations that the seller warrants the absence of third-party rights in the undocumented datasets, or a price reduction that reflects the uncertainty.
Do not accept verbal assurances that “the data was all properly licenced” without documentary evidence. That assurance is worthless if a third-party claim emerges post-close.
Not entirely. Using an open-source base model (Llama, Mistral, Falcon) transfers some risk from training data provenance to upstream licence terms — but both risks remain. The open-source model itself was trained on data with its own provenance; if that upstream data had provenance problems, those problems are embedded in any model fine-tuned on top. Additionally, open-source AI licences vary significantly: some are permissive (Apache 2.0), some carry commercial use restrictions (Llama community licence), and some impose conditions on how outputs can be used commercially.
The provenance audit for a fine-tuned open-source model must cover: (a) the fine-tuning dataset’s own provenance, (b) the base model’s licence terms and any restrictions on commercial fine-tuning, and (c) whether the base model’s licence survives an acquisition. For the full analysis of who owns AI outputs and the implications for acquirers, see our article on who owns AI outputs.
Training Data Provenance Audit Before You Close
WCR Legal conducts structured training data provenance audits for AI acquisitions — covering all five data type categories, mapped to deal structure options. Findings are delivered in a format suitable for SPA representations and W&I insurance applications.



Post Comment