AI Model Licensing and Legal Risks: What Developers Often Overlook

AI Model Licensing and Legal Risks: What Developers Often Overlook

AI Model Licensing and Legal Risks: What Developers Often Overlook

Legal Risk Audit

AI Model Licensing and Legal Risks: What Developers Often Overlook?

Most AI developers focus on the code. The legal risks hide in the training data, the model outputs, the upstream licences they inherit, and the regulations they didn't know applied to their deployment. This guide maps the blind spots — and what to do before they become a crisis.

Training data copyright Output liability Licence compliance gaps GDPR & scraping risks EU AI Act exposure Developer accountability

Introduction: The legal risks that catch developers by surprise

The vast majority of AI developers are not trying to break the law. They are building products, shipping models, and solving problems. The legal risks that emerge from AI development are not typically the result of bad intent — they are the result of legal frameworks that were not designed with AI in mind, applied to practices that have become standard in the developer community. Scraping public web data to train a model. Using open-source model weights without reading the licence. Deploying a model in a regulated sector without engaging with the sector-specific rules. Each of these is a common practice. Each carries legal risk that is frequently neither recognised nor managed.

This guide does not assume legal wrongdoing. It assumes that a developer who understands the legal risk is better positioned to address it than one who discovers it in a cease-and-desist letter, a regulatory inquiry, or a due diligence process that stalls a funding round. The goal is to make the legal risks visible before they become legal problems.

🌐

The public data myth

"If it's publicly accessible, we can use it for training."

Public accessibility does not transfer copyright, waive licence conditions, or create a lawful basis for processing personal data. The fact that content can be scraped is legally irrelevant to whether scraping it for training is permitted.

📦

The open-source shield myth

"We're using an open-source model, so we have no licence obligations."

Open-source licences are legal instruments, not permissions without conditions. Most open LLM licences include restrictions on commercial use, derivative works, and downstream deployment that developers routinely overlook or misread.

🏗️

The builder's exemption myth

"We're just building the tool — what users do with it isn't our problem."

Liability for AI outputs is not automatically insulated by the developer's intent. Depending on how the model is designed, deployed, and marketed, developers can face direct liability for outputs that cause harm — independent of the user who prompted them.

1

Training data risks

Copyright in scraped data, consent and lawful basis for personal data, database rights, and the legal effect of robots.txt and terms of service on training data collection.

High exposure
2

Output liability risks

Defamation in AI-generated content, copyright infringement in outputs, liability for harmful, discriminatory, or professionally regulated advice outputs.

High exposure
3

Upstream licence compliance risks

Copyleft contamination from incorporated open-source models, RAIL clause pass-through failures, CLA gaps, and commercial threshold breaches.

Medium–high
4

Regulatory deployment risks

EU AI Act provider obligations, sector-specific rules (healthcare, finance, employment), GDPR obligations at inference, and non-compliance penalties.

Emerging & growing
Framing
The AI legal risk stack

Unlike a conventional software product, an AI model accumulates legal risk at every stage of its existence: during data collection (copyright, consent, terms of service); during training (GDPR, database rights, trade secret misappropriation); at release (licence obligations, open-source compliance); at deployment (sector regulation, EU AI Act, liability for outputs); and in use (ongoing output liability, downstream licence pass-through, data processing obligations). Each layer is independent — compliance at one layer does not resolve exposure at another.

The sections that follow address each layer in turn, identifying the specific risks that developers most commonly fail to anticipate, the legal basis for those risks, and the practical steps that reduce or manage the exposure.

The legal landscape for AI is still developing. Courts in the US, EU, and UK are actively working through the questions of copyright infringement by training, developer liability for AI outputs, and the boundaries of open-source compliance obligations. The fact that many of these questions are unresolved does not mean the risk is low — it means the risk is uncertain, which is frequently worse from a commercial and investor perspective than a known liability.

1. Training data risks: copyright, scraping, and GDPR

The training dataset is the foundation of every AI model — and for many developers, it is also the primary source of unmanaged legal exposure. Three independent legal frameworks apply to training data simultaneously: copyright law, data protection law, and contract law (in the form of website terms of service and data licence conditions). Compliance with one does not ensure compliance with the others.

©️

Copyright in training data

High — active litigation worldwide

Copyright protects original expression — text, images, code, music, and other creative works — automatically upon creation, without registration. The owner of copyright in a work has the exclusive right to reproduce it. Training an AI model on copyrighted text involves making copies of that text: downloading it, processing it, and using it to adjust the model's parameters. Whether this constitutes copyright infringement is currently the subject of active litigation in the US (Andersen v. Stability AI, The New York Times v. OpenAI, and others), the UK (Getty Images v. Stability AI), and Europe.

The legal defences available to model developers vary by jurisdiction. The US "fair use" doctrine may protect transformative training uses — but fair use is a case-by-case analysis, not a blanket permission. The EU text and data mining (TDM) exception under the Digital Single Market Directive permits TDM for research purposes and for any purpose where the rightsholder has not expressly opted out. The opt-out mechanism is novel and its scope is disputed. UK law has a research TDM exception but its application to commercial AI training is contested.

Primary risk

Model trained on copyrighted text can regenerate or closely paraphrase that content, creating an infringement claim against both the model developer and downstream users.

Who is affected

All developers who scraped or used large-scale web datasets (Common Crawl, C4, LAION) without verifying rights clearance for commercial AI training.

Jurisdiction variance

US: fair use defence active but uncertain. EU: TDM exception with opt-out risk. UK: commercial use not clearly covered.

Severity

High — active litigation, potential injunctions against model distribution, statutory damages in the US.

Risk reduction steps

Document training data provenance for all datasets used. For commercial models, verify that datasets were assembled under licences that permit commercial AI training (not just research). Monitor for dataset-specific opt-out registrations (e.g., the EU TDM opt-out) and exclude opted-out sources before training or fine-tuning runs.

🔒

GDPR and personal data in training datasets

High — enforcement increasing across EU

Personal data — any information that relates to an identifiable individual — is pervasive in large-scale web datasets. Names, email addresses, social media posts, forum contributions, and professional profiles are all personal data. Processing personal data for AI training is a processing activity under GDPR, which means it requires a lawful basis. Legitimate interest is frequently claimed, but its application to large-scale commercial AI training is highly contested and has been challenged by multiple EU data protection authorities, including the Italian DPA (Garante), which temporarily suspended ChatGPT in Italy in 2023.

Even more complex is the ongoing obligation: once personal data has been incorporated into model weights through training, there is no fully satisfactory technical mechanism to "delete" that data in response to a data subject access or erasure request. The model cannot be queried to confirm whether an individual's data is "in" the weights, and retraining to remove a specific data subject is computationally prohibitive at scale. GDPR data subject rights under Articles 15–21 nonetheless apply — and regulators are beginning to enforce compliance at the training data layer.

Lawful basis risk

Legitimate interest for commercial AI training is contested. Consent is impractical at scale for web-scraped data. Research exception does not cover commercial use.

Data subject rights risk

Erasure and access requests cannot be fully honoured for personal data embedded in model weights — a fundamental GDPR compliance gap.

Special category risk

Training data may contain special category personal data (health, political opinions, religion) without the developer's awareness — requiring explicit consent or research exemption.

Penalty exposure

GDPR fines up to 4% of global annual turnover. Italian, French, and Irish DPAs have all opened investigations into major AI providers.

Risk reduction steps

Conduct a DPIA (Data Protection Impact Assessment) before training on personal data. Document the lawful basis. Implement filtering to remove personal data from training datasets where possible. Establish a process for handling data subject requests that is honest about the limitations of erasure from trained weights. Consider privacy-preserving training techniques (differential privacy, federated learning) for high-risk deployments.

🤖

Scraping, robots.txt, and terms of service

Medium — contractual and technical controls

Web scraping at scale — the standard method for assembling large text training datasets — interacts with two sets of legal controls that developers frequently overlook. First, website terms of service typically prohibit automated scraping, use of content for machine learning, and commercial exploitation of the site's data. These terms create contractual obligations that are enforceable independently of copyright. Second, robots.txt files signal which parts of a site the site operator does not want crawled — and while robots.txt is not legally binding by default, violating it in conjunction with ToS restrictions strengthens a claim of unauthorised access or computer fraud under laws like the US CFAA.

The EU TDM exception under Article 4 of the DSM Directive specifically states that it applies only where the rightsholder has not expressly reserved the right. A robots.txt disallow combined with a ToS prohibition is increasingly treated as an express reservation of rights for TDM purposes — meaning that scraping a site that has both in place may fall outside the TDM exception even for research purposes, and certainly for commercial purposes.

ToS breach risk

Contract claims for breach of ToS by sites whose data was scraped for training. Remedies include injunctions and damages.

TDM opt-out effect

Sites with both robots.txt AI-disallow and ToS prohibition may have effectively opted out of EU TDM exception — making scraping potentially infringing.

Risk reduction steps

Audit training datasets for sources with explicit AI-scraping prohibitions. For future training runs, implement a scraping policy that respects robots.txt AI directives (increasingly standard practice). Document the legal basis assessment for each major dataset component. Consider licensed datasets from providers who have cleared training rights for commercial AI use.

Dataset provenance record — documented source, collection method, and date for every dataset component used in training

Copyright assessment — documented legal basis for training use of each dataset (licence, fair use/TDM analysis, consent)

Personal data audit — identification of personal data in training datasets and DPIA completed where required

ToS and robots.txt review — major scraped sources checked for AI-specific prohibitions; opted-out sources excluded

Special category data check — assessment of whether training data contains health, political, religious, or other special category personal data requiring heightened protection

Data subject rights process — documented procedure for handling access, erasure, and restriction requests relating to training data personal data

Data source type Copyright risk GDPR risk ToS/robots risk EU TDM coverage
Scraped public web (general) High High (personal data) Medium–High Partial (opt-out risk)
Licensed dataset (ML-specific) Low Medium Low N/A (licensed)
Social media scrape High High High (ToS prohibit) Likely blocked
Academic / research datasets Medium Medium Low Research TDM applies
Synthetic / generated data Low Low (if no PII) N/A N/A
Proprietary / first-party data Low Medium (if personal) Low N/A

2. Output liability: defamation, IP infringement, and harmful content

When an AI model generates content that causes harm — a defamatory statement about a real person, a passage that reproduces copyrighted material, medically dangerous advice — the question of who is legally responsible is not resolved by pointing to the user who entered the prompt. Depending on the design, deployment, and marketing of the model, the developer may face direct or contributory liability for outputs they did not produce and could not predict.

Output liability is one of the most contested areas of AI law precisely because the traditional frameworks — defamation law, copyright law, product liability — were designed for human publishers and product manufacturers, not for probabilistic content generators. Courts and regulators are applying these frameworks to AI systems with results that are not yet fully settled, which means the risk profile is uncertain rather than clearly defined.

High risk
🗣️

Defamation and false statements about real people

AI models — particularly large language models — can generate confident, plausible, and entirely false statements about named real individuals: fabricated quotes, invented professional histories, false allegations of criminal conduct. This is the hallucination problem as a legal liability. Defamation requires a false statement of fact about an identifiable person, published to a third party, causing damage to their reputation. AI-generated content can satisfy all these elements. The developer's liability depends on whether the developer is treated as a publisher (directly liable), a platform (potentially insulated under Section 230 in the US or the EU DSA in Europe), or a product manufacturer (liable for design defects). None of these frameworks fits AI cleanly, and litigation is in early stages.

Developer exposure

Significant where: the model is marketed as a factual information source, the model does not clearly disclaim the possibility of inaccurate outputs, or the model is designed to generate statements about named individuals.

High risk
📝

Copyright infringement in AI-generated outputs

Generative models trained on copyrighted text can reproduce passages from that training data in response to certain prompts — a phenomenon demonstrated in both research (Carlini et al.) and active litigation. The legal question is whether the output constitutes copyright infringement (reproduction of a substantial part of the original work), and if so, whether the developer is directly, contributorily, or vicariously liable. For code generation models, the risk includes reproducing GPL-licensed code without attribution in outputs delivered to commercial developers — creating a licence compliance obligation for users who may be unaware of the provenance of the generated code. GitHub Copilot is currently the subject of a class action on precisely this basis.

Developer exposure

Contributory liability risk where the developer knows the model can reproduce copyrighted content and does not implement reasonable guardrails. Direct liability risk for code outputs that reproduce GPL or other licensed code without appropriate disclosure.

High risk (regulated sectors)
⚕️

Professional advice outputs in regulated domains

AI models deployed in legal, medical, or financial contexts — or general-purpose models used by users seeking advice in these domains — can generate outputs that constitute professional advice without the safeguards required of human professionals. Medical advice that leads a user to take harmful action, legal advice that is incorrect and causes a user to miss a legal deadline, financial advice that results in material financial loss: each is a scenario where a developer may face claims of negligence, product liability, or breach of applicable sector regulation. The disclaimer "this is not professional advice" in the UI may reduce but does not eliminate this risk — particularly where the model is marketed as capable in these domains.

Developer exposure

Heightened where: the model is specifically marketed for professional-domain use, the UI does not provide clear and prominent disclaimers, or the model is deployed without human oversight in a high-stakes context.

Medium–High risk
⚠️

Discriminatory, harmful, and illegal content outputs

AI models can generate discriminatory content — stereotypical, offensive, or biased outputs that could constitute harassment or discrimination under applicable law in specific deployment contexts. In employment-related applications, discriminatory AI outputs can trigger liability under equal treatment legislation. More broadly, models that can be prompted to produce instructions for illegal activities, content that sexualises minors, or content that incites violence create exposure for both the developer and the deploying organisation. The EU AI Act explicitly prohibits certain categories of AI system output (including real-time biometric surveillance and social scoring) and the EU DSA requires platform-level content moderation that extends to AI-generated content in some deployment contexts.

Developer exposure

Regulatory risk under the EU AI Act for prohibited output categories. Civil liability for discriminatory outputs in employment and consumer contexts. Criminal risk in jurisdictions where generating certain content is an offence.

1

Output filtering and content moderation — implement output-layer filtering for the highest-risk categories (personal data exfiltration, verbatim reproduction of known copyrighted text, medical/legal/financial advice without qualification, prohibited content categories under the EU AI Act)

2

Clear and prominent disclaimers — ensure UI-level disclaimers are visible, specific (not just "may be inaccurate"), and calibrated to the deployment context; a general disclaimer on a model marketed for medical use is legally insufficient

3

Copyright reproduction detection — for generative text and code models, implement detection for outputs that closely reproduce training data and add citation or disclosure where reproduction is detected

4

Sector-specific deployment restrictions — for general-purpose models, consider restricting deployment in high-risk regulated sectors (healthcare, legal, financial services) without a specific compliance assessment and appropriate safeguards for each sector

5

Document human oversight requirements — for high-stakes deployment contexts, specify in the product documentation and terms of service that human review is required before acting on AI outputs, particularly in professional and decision-making contexts

The Section 230 / DSA platform defence does not clearly apply to AI developers

Section 230 of the US Communications Decency Act protects platforms from liability for third-party content published on them. The EU Digital Services Act provides analogous protections for hosting providers. It is not established that these protections extend to AI model developers whose models generate content in response to user inputs — as opposed to hosting third-party content. Courts in the US have divided on whether AI-generated content is "third-party content" for Section 230 purposes. Developers who assume they are insulated from output liability by these provisions may be significantly underestimating their exposure.

Output type Legal basis for claim Developer liability risk Key factors Primary mitigation
False statements about real people Defamation law High (if marketed as factual) Marketing, disclaimers, product design Disclaimers; hallucination filtering
Reproduced copyrighted text/code Copyright infringement High — active litigation Training data, output filters, knowledge Reproduction detection; attribution
Medical advice outputs Negligence; sector regulation High in sector deployment Deployment context; UI disclaimers Sector restriction; human oversight
Legal / financial advice Negligence; regulated advice rules High in sector deployment Marketing; prominent disclaimers Sector restriction; qualification
Discriminatory outputs Equality law; EU AI Act Medium–High (employment use) Deployment context; bias testing Bias testing; deployment restrictions
Prohibited content (CSAM, incitement) Criminal law; EU AI Act High — potential criminal Safeguards; reporting obligations Content moderation; CSAM detection

3. Licence Compliance Failures: The Open-Source Traps Developers Miss

Open-source AI model licences are not absence of conditions — they are a set of legal obligations that travel with the model weights whenever they are used, fine-tuned, merged, or deployed. The developer who assumes that "open source" means "no legal strings attached" is in the same position as someone who assumes a Creative Commons licence permits any use. The conditions are real, enforceable, and frequently violated by developers who never read them.

Unlike traditional software copyleft, AI model licence compliance is complicated by the fact that models can be combined in ways that are technically invisible: fine-tuning incorporates the upstream model's weights; distillation transfers knowledge from a licenced model; merging blends multiple models into one. Each of these techniques can create downstream licence obligations that the developer has neither identified nor satisfied.

🔗

Copyleft contamination

Trap 1
"We fine-tuned a GPL-licensed base model but didn't realise we needed to open-source our fine-tuned version."

When a developer fine-tunes a model released under a strong copyleft licence — such as the GNU GPL or AGPL — the resulting fine-tuned model is almost certainly a derivative work subject to the same licence. This means the fine-tuned weights must be released under identical terms, any proprietary layers or adaptations become subject to copyleft, and distribution of the fine-tuned model (including as a hosted service in AGPL's case) triggers disclosure obligations.

The developer who builds a commercial product on a GPL base model and sells API access to it without releasing the fine-tuned weights has, in all likelihood, violated the licence. The fact that this is common practice in the market does not make it compliant.

High risk — commercial viability threat
🚫

RAIL clause pass-through failures

Trap 2
"We included the RAIL use restrictions in our own licence, but users complained we'd restricted more than the upstream licence required."

The Responsible AI Licence (RAIL) framework — used in variants by Stability AI, BigScience, and others — combines permissive distribution rights with a list of prohibited use cases. These use restrictions are legally binding on all downstream users and must be passed through to any parties who access the model or derivatives of it.

Common failures include: omitting the prohibited use schedule from a fine-tuned model's licence; deploying the model in a use case that falls within the prohibited list (e.g., certain surveillance applications for OpenRAIL-M); and failing to include the required notices when redistributing. Because RAIL licences vary significantly across model families, developers must review the specific prohibited use schedule for each upstream model they incorporate.

High risk — licence breach, injunctive relief exposure
📋

Attribution and notice failures

Trap 3
"We didn't include the 'Built with Meta Llama' attribution — we didn't realise it was a licence condition rather than a brand guideline."

Many AI model licences require attribution as a condition of use, not merely as a brand guideline. The Meta Llama licence requires inclusion of "Built with Meta Llama" in derivative model names exceeding certain user thresholds. Apache 2.0 requires reproduction of copyright notices and attribution notices. Failure to include required notices in model cards, documentation, and public-facing materials constitutes a licence breach independent of any other compliance failure.

Attribution obligations are frequently omitted in practice because developers treat them as optional courtesy requirements rather than conditions precedent to the licence grant. Courts have consistently enforced attribution requirements in open-source licences as legally binding conditions, not aspirational guidelines.

Medium–high risk — licence termination possible
📊

Commercial threshold breaches

Trap 4
"We started with a non-commercial licence but our user base grew past the threshold — we needed a commercial licence retroactively."

Several major AI model licences impose commercial use restrictions that apply past a defined user or revenue threshold. The original Llama 2 licence required a separate commercial licence for deployments with more than 700 million monthly active users. Other models restrict commercial use entirely without a separate agreement, or require prior written consent for commercial deployment.

The practical risk is not just post-threshold exposure — it is the failure to identify that a commercial licence is needed before a product scales. A startup that builds its entire product on a non-commercial model without a path to a commercial licence may find itself unable to legally continue operating at commercial scale.

High risk — operational continuity threat at scale
Key concept
How copyleft contamination propagates through AI development

In traditional software, copyleft contamination occurs when GPL-licensed code is incorporated into a larger work. In AI development, contamination can occur through fine-tuning (the resulting weights incorporate the upstream model's learned representations), distillation (where the student model is trained to replicate the teacher model's outputs, which may constitute a derivative work under the teacher's licence), and merging (where the merged model combines weight matrices from multiple source models, each of which carries its own licence terms).

The legal analysis for each technique remains unsettled, but the prudent approach is to treat each technique as creating a derivative work subject to the upstream licence conditions — and to select upstream models accordingly. A model trained exclusively on Apache 2.0 or MIT-licenced base models carries materially less copyleft risk than one incorporating AGPL or GPL components.

RAIL (Responsible AI Licence) and its variants — OpenRAIL, OpenRAIL-M, CreativeML-OpenRAIL — represent a distinct category of licence that is neither standard open-source nor fully proprietary. Understanding their three core components is essential to compliance:

🔓

Permissive base

RAIL licences grant broad rights to access, use, copy, modify, merge, publish, and distribute the model — including commercially — subject to the conditions below. The permissive base is comparable to Apache 2.0 in terms of what it allows.

📜

Prohibited use schedule

A legally binding list of prohibited use cases appended to the licence. For OpenRAIL-M (used on many Hugging Face models), this includes uses such as generating illegal content, discriminatory automated decisions, and certain law enforcement applications. These prohibitions must be passed through to all downstream users and recipients.

🔁

Pass-through obligation

Any distribution of the model or derivative models — including fine-tuned versions — must include the full prohibited use schedule. Failure to pass through the restrictions means downstream users are not bound, and the distributing developer may face primary liability for uses they could not contractually prevent.

🚩
CLA and upstream contribution gaps: the due diligence blind spot
  • Contributor Licence Agreements (CLAs) are the mechanism by which open-source projects obtain the right to sublicence contributions under the project licence. A model trained on or incorporating community contributions without a valid CLA may have no valid right to relicence those contributions.
  • Many AI model training datasets were assembled using community contributions — prompts, annotations, RLHF feedback — without a CLA being executed. Where that data was used in training, the resulting model's licence to that data may be defective.
  • Acquiring a company or model that did not implement CLA processes inherits the defect. Buyers in M&A transactions should conduct specific due diligence on CLA coverage for all community-contributed training data and model code.
  • Where a CLA has not been executed, the contributing party retains their original copyright and may be able to assert rights over the contribution — including asserting that it should not be incorporated in a commercially licenced product.
  • Developers releasing models as open source should implement a CLA process before accepting any external contributions to training datasets, model code, or evaluation frameworks.

Beyond copyleft, RAIL clauses, and CLA gaps, the most practically significant licence compliance failures arise from simply not reading the licence. The licences for the most widely used base models — Llama 3, Mistral, Falcon, Gemma, Stable Diffusion — all contain materially different conditions. Treating them as interchangeable because they are "open source" is one of the most consistent compliance failures in the AI development market.

The following audit process provides a minimum baseline for licence compliance before deploying any model that incorporates upstream open-source weights:

1
Identify all upstream model components List every base model, adapter, fine-tune, or merged component used in your final model. Include intermediate fine-tunes — not just the immediate parent model. Each layer in the training chain may carry separate licence obligations.
2
Read each upstream licence in full Do not rely on licence type alone (e.g., "Apache 2.0") — the specific version and any amendments or addenda may contain conditions that differ from the standard licence text. AI model licences frequently modify base licences with additional terms.
3
Check copyleft and derivative work implications For each GPL, AGPL, LGPL, or copyleft-adjacent licence in the upstream chain: assess whether your use constitutes a derivative work. If uncertain, obtain legal advice — this is the question that most commonly creates undisclosed M&A liability.
4
Identify and pass through RAIL prohibitions For every model with a RAIL or RAIL-variant licence: extract the prohibited use schedule, confirm your deployment does not fall within it, and include the same prohibitions in your downstream licence or terms of service to users and API customers.
5
Satisfy all attribution and notice requirements Prepare a comprehensive attribution list covering all upstream model licences. Include required copyright notices, attribution statements, and licence texts in model cards, documentation, and API documentation as required by each upstream licence.
6
Verify commercial licence eligibility Confirm whether your current and projected deployment scale requires a commercial licence from any upstream model provider. For models with user or revenue thresholds, establish a monitoring mechanism to detect when thresholds are approached.
7
Implement CLA for all external contributions Before accepting any external contributions to training data, model code, or evaluation sets: implement a CLA that grants you the rights needed to sublicence contributions under your chosen model licence. Execute CLAs retroactively for existing contributions where possible.
Licence / model family Commercial use Copyleft / derivative? RAIL restrictions? Attribution required? Key condition
Apache 2.0 (e.g., many Falcon variants) Permitted No — permissive No Yes — copyright & NOTICE file Reproduce notices; state changes to modified files
MIT (various smaller models) Permitted No — permissive No Yes — copyright notice Include MIT licence text in all distributions
Llama 3 Community Licence (Meta) Conditional No — custom permissive Prohibited uses listed Yes — "Built with Meta Llama" if >700M MAU Separate commercial licence required at scale threshold; EULA pass-through required
Gemma Terms of Use (Google) Conditional No — custom permissive Prohibited uses listed Yes — include Gemma attribution Prohibited uses include certain safety-critical applications; pass-through required
OpenRAIL-M (BigScience, Stability variants) Permitted No — but conditions pass through Yes — binding schedule Yes — include licence and use restrictions Prohibited use schedule must be passed through to all downstream users
AGPL v3 (rare in frontier models; some components) Restricted Yes — strong copyleft No Yes — copyright notices Network use triggers source disclosure; derivative works must be AGPL
Proprietary / research-only (various academic models) Not permitted N/A Varies Varies Commercial deployment requires separate licence negotiation; fine-tuning may be prohibited

Licence compliance in AI development is an ongoing obligation, not a one-time check at release. As upstream models release new versions with revised licence terms, as your deployment scales across commercial thresholds, and as your model is fine-tuned or incorporated into other products by downstream users, the compliance picture changes. The developers who manage this well maintain a licence inventory as a living document — not as a spreadsheet that was correct at launch and has not been reviewed since.

4. Deployment and Regulatory Risks: The Rules Developers Didn't Know Applied

Building an AI model and deploying an AI model are legally distinct activities. The developer who completes training and ships the model to production without engaging with the regulatory environment applicable to that deployment has moved from a development risk profile to an operational one — and the operational risks are, in several jurisdictions, now carrying material penalties. The EU AI Act, sector-specific financial and healthcare regulations, employment law obligations, and GDPR obligations at inference point are not hypothetical future risks. They are current law, and their application to AI developers and deployers is no longer deferred.

The EU AI Act: What Developers Must Understand

The EU AI Act entered into force in August 2024, with provisions applying on a phased basis through 2026 and beyond. It creates a risk-tiered regulatory framework that applies to developers placing AI systems on the EU market, deployers using AI systems within the EU, and providers of general-purpose AI (GPAI) models — regardless of where the developer is incorporated. A US or UK developer whose model is used in the EU is within scope.

Prohibited

Unacceptable risk — banned outright

AI systems that deploy subliminal manipulation, exploit vulnerabilities of specific groups, enable social scoring by public authorities, or conduct real-time biometric identification in public spaces (with narrow exceptions). Developers or deployers placing prohibited systems on the EU market face penalties of up to €35 million or 7% of global annual turnover.

High risk

High-risk AI systems — mandatory compliance obligations

AI systems used in critical infrastructure, education, employment, essential services, law enforcement, migration, and justice. High-risk AI systems require: conformity assessment before market placement; registration in the EU database; risk management and quality management systems; technical documentation; human oversight mechanisms; and post-market monitoring. Fines of up to €15 million or 3% of global turnover apply.

Limited risk

Limited-risk AI systems — transparency obligations

AI systems that interact with users (chatbots), generate synthetic content, or create deepfakes must meet transparency obligations — primarily the obligation to disclose to users that they are interacting with an AI system, and to label AI-generated content as such. Failure to comply carries penalties of up to €7.5 million or 1.5% of global turnover.

Minimal risk

Minimal-risk AI — voluntary codes of practice

Most AI applications fall in the minimal-risk category — spam filters, AI in video games, inventory management. These are subject only to voluntary codes of practice. However, many developers misclassify higher-risk systems as minimal risk. Classification errors carry the same penalties as substantive non-compliance.

A distinct and critical category applies to general-purpose AI (GPAI) model providers — developers who release foundation models or large language models that can be used for a wide range of downstream applications. GPAI providers have separate obligations under the EU AI Act that apply regardless of how their model is classified by downstream deployers:

📄
Technical documentation GPAI providers must prepare and maintain detailed technical documentation covering training methodology, training data, evaluation results, capabilities and limitations, known risks, and energy consumption. This documentation must be made available to the EU AI Office on request.
⚖️
Copyright compliance summary GPAI providers must implement a policy to comply with EU copyright law — including the text and data mining exception requirements — and make a summary of the training data publicly available. This is the first regulatory mechanism requiring disclosure of training data practices.
🔥
Systemic risk assessment (high-capability models) GPAI models with systemic risk (defined by training compute threshold, currently 10²⁵ FLOPs) must conduct adversarial testing, report serious incidents to the EU AI Office, ensure cybersecurity, and provide information on energy consumption. Systemic risk GPAI obligations are the most demanding in the Act.
📋
Downstream information obligations GPAI providers must make available to downstream deployers the information and documentation necessary for those deployers to meet their own obligations under the Act. This creates a compliance chain running from foundation model provider to end deployer.

Sector-Specific Regulatory Obligations

The EU AI Act creates a horizontal layer of AI-specific regulation. Sector-specific law — which predates the AI Act and will continue operating alongside it — creates additional, overlapping obligations for AI deployed in regulated sectors. Developers who build for these sectors without engaging with sector-specific rules face dual exposure: non-compliance with sector regulation and, increasingly, non-compliance with AI Act provisions that cross-reference sector requirements.

Sector Applicable framework Key AI-specific obligations Risk level Regulator
Healthcare / medical devices EU MDR / IVDR, FDA 510(k) / De Novo (US) AI-assisted diagnostic tools may qualify as medical devices requiring conformity assessment, clinical evidence, post-market surveillance, and registration. Software-as-a-medical-device (SaMD) classification applies to many AI health tools. Very high EMA, national competent authorities; FDA (US)
Financial services EU AI Act (high-risk), DORA, MiFID II, SR 11-7 (US) AI used in credit scoring, AML, fraud detection, and algorithmic trading faces explainability, model validation, fairness testing, and governance requirements. SR 11-7 (US Federal Reserve guidance) requires independent model validation and documentation for models in regulated use. High ECB / EBA / national regulators; Federal Reserve, OCC (US)
Employment and HR EU AI Act (high-risk), EEOC guidelines (US), local labour law AI used in recruitment, performance evaluation, promotion, or dismissal decisions is high-risk under the EU AI Act. US EEOC has issued guidance on AI in hiring under Title VII. NYC Local Law 144 requires bias audits for automated employment decision tools used in NYC. High EU national regulators; EEOC, NYC DCWP (US)
Legal and professional advice Legal services regulation, professional conduct rules AI tools that provide legal, medical, financial, or accounting advice in regulated jurisdictions may constitute unauthorised practice of a profession. Developer liability may arise where the system is marketed in terms that create reliance without adequate disclaimers. Medium–high Professional regulatory bodies; courts
Education EU AI Act (high-risk), FERPA (US), national education law AI used in student assessment, admissions, or monitoring is classified as high-risk under the EU AI Act. FERPA (US) governs the use of student data. Personalisation engines that influence educational outcomes carry additional obligations. Medium–high EU national regulators; Dept of Education (US)
🔐
GDPR obligations at inference — the obligation developers overlook

GDPR obligations do not end at data collection. Every time a deployed model processes a prompt or input that contains or relates to an identifiable natural person, GDPR applies to that processing. This means: a lawful basis for processing is required at inference; data minimisation obligations apply to prompt handling and logging; retention limits apply to inference logs; data subject rights (access, erasure, correction) apply to outputs generated about identifiable individuals; and Article 22 automated decision-making obligations may apply where the model's output is the basis for a significant decision affecting a data subject. The developer who has GDPR-compliant training data but GDPR-non-compliant inference infrastructure has solved half the problem.

The regulatory environment for AI is not static. The EU AI Act's implementation guidance is still being developed; US federal AI regulation is evolving through executive order, agency rulemaking, and emerging state law; the UK is pursuing a sector-led rather than horizontal regulatory approach. The developer who builds regulatory compliance as a one-time exercise is likely to be non-compliant within eighteen months. The developers who manage this well treat regulatory tracking as a standing function, not a project.

Legal Risk Audit

Conclusion: A Six-Step Legal Risk Audit for AI Developers

The legal risks covered in this guide — training data liability, output liability, licence compliance failures, and regulatory exposure — each operate independently. Compliance at one layer does not resolve exposure at another. The developer who resolves their GDPR training data obligations but ignores their EU AI Act deployment obligations has not managed their legal risk: they have managed one quarter of it. The following six-step audit provides a practical framework for identifying and addressing exposure across all four layers.

1

Audit your training data

Document the source, licence, and lawful processing basis for every dataset used in training. Flag datasets acquired by scraping and assess copyright and GDPR exposure. Verify that robots.txt instructions and terms of service were complied with at the time of collection.

2

Map your output liability exposure

Identify the categories of output your model generates. Assess defamation risk (factual claims about identifiable individuals), copyright risk (verbatim or near-verbatim reproductions), and professional advice risk (outputs in regulated domains). Document the measures in place to detect and limit each category.

3

Conduct upstream licence due diligence

Identify all upstream model components. Read each licence. Check for copyleft obligations, RAIL prohibited use schedules, attribution requirements, and commercial thresholds. Document your compliance with each. Implement a CLA process for external contributions before the next release.

4

Classify your AI system under the EU AI Act

Determine whether you are a GPAI provider (general-purpose model), a high-risk AI system provider, or a lower-risk system. Identify your applicable obligations. If your compute threshold exceeds 10²⁵ FLOPs, assess systemic risk obligations. Begin the technical documentation process regardless of tier.

5

Assess sector-specific regulation

For each deployment sector your model operates in or is likely to operate in: identify the applicable sector-specific framework. Determine whether your model or application falls within scope. Engage specialist legal advice for healthcare, financial services, and employment deployments before launch.

6

Build a legal compliance monitoring function

Assign ownership of AI legal compliance to a specific role. Establish a process for monitoring changes to the EU AI Act implementation guidance, US federal and state AI regulation, and sector-specific rulemaking. Schedule annual legal risk reviews timed to coincide with major model updates or new deployment markets.

The goal of this audit is not to make AI development more cautious — it is to make it more durable. A model with an unresolved copyright claim in its training data is not a finished product; it is a liability waiting to be discovered. A deployment that ignores its EU AI Act classification is not agile; it is non-compliant. Developers who address these risks before they become problems do not lose time — they avoid losing far more time to the enforcement actions, litigation, and due diligence failures that follow undiscovered exposure. Legal risk in AI development is manageable. It is only unmanageable when it is invisible.

Oleg Prosin is the Managing Partner at WCR Legal, focusing on international business structuring, regulatory frameworks for FinTech companies, digital assets, and licensing regimes across various jurisdictions. Works with founders and investment firms on compliance, operating models, and cross-border expansion strategies.