AI Model Licensing and Legal Risks: What Developers Often Overlook
AI Model Licensing and Legal Risks: What Developers Often Overlook?
Most AI developers focus on the code. The legal risks hide in the training data, the model outputs, the upstream licences they inherit, and the regulations they didn't know applied to their deployment. This guide maps the blind spots — and what to do before they become a crisis.
-
0Introduction: The Legal Risks That Catch Developers by Surprise Why technical excellence is not enough — the categories of risk that emerge at the legal and licensing layer→
-
1Training Data Risks: Copyright, Scraping, and GDPR What you trained on, whether you had the right to, and why the answer matters long after training is complete→
-
2Output Liability: Defamation, IP Infringement, and Harmful Content Who is legally responsible for what your AI says, writes, or creates — and what that means for developers→
-
3Licence Compliance Failures: The Open-Source Traps Developers Miss Upstream licence violations, copyleft contamination, RAIL clause breaches, and CLA gaps→
-
4Deployment and Regulatory Risks: The Rules Developers Didn't Know Applied EU AI Act exposure, sector-specific regulation, and the gap between building and deploying→
-
5Conclusion: A Six-Step Legal Risk Audit for AI Developers The practical checklist for identifying and addressing legal exposure before it becomes a liability→
Introduction: The legal risks that catch developers by surprise
The vast majority of AI developers are not trying to break the law. They are building products, shipping models, and solving problems. The legal risks that emerge from AI development are not typically the result of bad intent — they are the result of legal frameworks that were not designed with AI in mind, applied to practices that have become standard in the developer community. Scraping public web data to train a model. Using open-source model weights without reading the licence. Deploying a model in a regulated sector without engaging with the sector-specific rules. Each of these is a common practice. Each carries legal risk that is frequently neither recognised nor managed.
This guide does not assume legal wrongdoing. It assumes that a developer who understands the legal risk is better positioned to address it than one who discovers it in a cease-and-desist letter, a regulatory inquiry, or a due diligence process that stalls a funding round. The goal is to make the legal risks visible before they become legal problems.
The public data myth
Public accessibility does not transfer copyright, waive licence conditions, or create a lawful basis for processing personal data. The fact that content can be scraped is legally irrelevant to whether scraping it for training is permitted.
The open-source shield myth
Open-source licences are legal instruments, not permissions without conditions. Most open LLM licences include restrictions on commercial use, derivative works, and downstream deployment that developers routinely overlook or misread.
The builder's exemption myth
Liability for AI outputs is not automatically insulated by the developer's intent. Depending on how the model is designed, deployed, and marketed, developers can face direct liability for outputs that cause harm — independent of the user who prompted them.
Training data risks
Copyright in scraped data, consent and lawful basis for personal data, database rights, and the legal effect of robots.txt and terms of service on training data collection.
Output liability risks
Defamation in AI-generated content, copyright infringement in outputs, liability for harmful, discriminatory, or professionally regulated advice outputs.
Upstream licence compliance risks
Copyleft contamination from incorporated open-source models, RAIL clause pass-through failures, CLA gaps, and commercial threshold breaches.
Regulatory deployment risks
EU AI Act provider obligations, sector-specific rules (healthcare, finance, employment), GDPR obligations at inference, and non-compliance penalties.
Unlike a conventional software product, an AI model accumulates legal risk at every stage of its existence: during data collection (copyright, consent, terms of service); during training (GDPR, database rights, trade secret misappropriation); at release (licence obligations, open-source compliance); at deployment (sector regulation, EU AI Act, liability for outputs); and in use (ongoing output liability, downstream licence pass-through, data processing obligations). Each layer is independent — compliance at one layer does not resolve exposure at another.
The sections that follow address each layer in turn, identifying the specific risks that developers most commonly fail to anticipate, the legal basis for those risks, and the practical steps that reduce or manage the exposure.
The legal landscape for AI is still developing. Courts in the US, EU, and UK are actively working through the questions of copyright infringement by training, developer liability for AI outputs, and the boundaries of open-source compliance obligations. The fact that many of these questions are unresolved does not mean the risk is low — it means the risk is uncertain, which is frequently worse from a commercial and investor perspective than a known liability.
1. Training data risks: copyright, scraping, and GDPR
The training dataset is the foundation of every AI model — and for many developers, it is also the primary source of unmanaged legal exposure. Three independent legal frameworks apply to training data simultaneously: copyright law, data protection law, and contract law (in the form of website terms of service and data licence conditions). Compliance with one does not ensure compliance with the others.
Copyright in training data
High — active litigation worldwideCopyright protects original expression — text, images, code, music, and other creative works — automatically upon creation, without registration. The owner of copyright in a work has the exclusive right to reproduce it. Training an AI model on copyrighted text involves making copies of that text: downloading it, processing it, and using it to adjust the model's parameters. Whether this constitutes copyright infringement is currently the subject of active litigation in the US (Andersen v. Stability AI, The New York Times v. OpenAI, and others), the UK (Getty Images v. Stability AI), and Europe.
The legal defences available to model developers vary by jurisdiction. The US "fair use" doctrine may protect transformative training uses — but fair use is a case-by-case analysis, not a blanket permission. The EU text and data mining (TDM) exception under the Digital Single Market Directive permits TDM for research purposes and for any purpose where the rightsholder has not expressly opted out. The opt-out mechanism is novel and its scope is disputed. UK law has a research TDM exception but its application to commercial AI training is contested.
Model trained on copyrighted text can regenerate or closely paraphrase that content, creating an infringement claim against both the model developer and downstream users.
All developers who scraped or used large-scale web datasets (Common Crawl, C4, LAION) without verifying rights clearance for commercial AI training.
US: fair use defence active but uncertain. EU: TDM exception with opt-out risk. UK: commercial use not clearly covered.
High — active litigation, potential injunctions against model distribution, statutory damages in the US.
Document training data provenance for all datasets used. For commercial models, verify that datasets were assembled under licences that permit commercial AI training (not just research). Monitor for dataset-specific opt-out registrations (e.g., the EU TDM opt-out) and exclude opted-out sources before training or fine-tuning runs.
GDPR and personal data in training datasets
High — enforcement increasing across EUPersonal data — any information that relates to an identifiable individual — is pervasive in large-scale web datasets. Names, email addresses, social media posts, forum contributions, and professional profiles are all personal data. Processing personal data for AI training is a processing activity under GDPR, which means it requires a lawful basis. Legitimate interest is frequently claimed, but its application to large-scale commercial AI training is highly contested and has been challenged by multiple EU data protection authorities, including the Italian DPA (Garante), which temporarily suspended ChatGPT in Italy in 2023.
Even more complex is the ongoing obligation: once personal data has been incorporated into model weights through training, there is no fully satisfactory technical mechanism to "delete" that data in response to a data subject access or erasure request. The model cannot be queried to confirm whether an individual's data is "in" the weights, and retraining to remove a specific data subject is computationally prohibitive at scale. GDPR data subject rights under Articles 15–21 nonetheless apply — and regulators are beginning to enforce compliance at the training data layer.
Legitimate interest for commercial AI training is contested. Consent is impractical at scale for web-scraped data. Research exception does not cover commercial use.
Erasure and access requests cannot be fully honoured for personal data embedded in model weights — a fundamental GDPR compliance gap.
Training data may contain special category personal data (health, political opinions, religion) without the developer's awareness — requiring explicit consent or research exemption.
GDPR fines up to 4% of global annual turnover. Italian, French, and Irish DPAs have all opened investigations into major AI providers.
Conduct a DPIA (Data Protection Impact Assessment) before training on personal data. Document the lawful basis. Implement filtering to remove personal data from training datasets where possible. Establish a process for handling data subject requests that is honest about the limitations of erasure from trained weights. Consider privacy-preserving training techniques (differential privacy, federated learning) for high-risk deployments.
Scraping, robots.txt, and terms of service
Medium — contractual and technical controlsWeb scraping at scale — the standard method for assembling large text training datasets — interacts with two sets of legal controls that developers frequently overlook. First, website terms of service typically prohibit automated scraping, use of content for machine learning, and commercial exploitation of the site's data. These terms create contractual obligations that are enforceable independently of copyright. Second, robots.txt files signal which parts of a site the site operator does not want crawled — and while robots.txt is not legally binding by default, violating it in conjunction with ToS restrictions strengthens a claim of unauthorised access or computer fraud under laws like the US CFAA.
The EU TDM exception under Article 4 of the DSM Directive specifically states that it applies only where the rightsholder has not expressly reserved the right. A robots.txt disallow combined with a ToS prohibition is increasingly treated as an express reservation of rights for TDM purposes — meaning that scraping a site that has both in place may fall outside the TDM exception even for research purposes, and certainly for commercial purposes.
Contract claims for breach of ToS by sites whose data was scraped for training. Remedies include injunctions and damages.
Sites with both robots.txt AI-disallow and ToS prohibition may have effectively opted out of EU TDM exception — making scraping potentially infringing.
Audit training datasets for sources with explicit AI-scraping prohibitions. For future training runs, implement a scraping policy that respects robots.txt AI directives (increasingly standard practice). Document the legal basis assessment for each major dataset component. Consider licensed datasets from providers who have cleared training rights for commercial AI use.
Dataset provenance record — documented source, collection method, and date for every dataset component used in training
Copyright assessment — documented legal basis for training use of each dataset (licence, fair use/TDM analysis, consent)
Personal data audit — identification of personal data in training datasets and DPIA completed where required
ToS and robots.txt review — major scraped sources checked for AI-specific prohibitions; opted-out sources excluded
Special category data check — assessment of whether training data contains health, political, religious, or other special category personal data requiring heightened protection
Data subject rights process — documented procedure for handling access, erasure, and restriction requests relating to training data personal data
| Data source type | Copyright risk | GDPR risk | ToS/robots risk | EU TDM coverage |
|---|---|---|---|---|
| Scraped public web (general) | High | High (personal data) | Medium–High | Partial (opt-out risk) |
| Licensed dataset (ML-specific) | Low | Medium | Low | N/A (licensed) |
| Social media scrape | High | High | High (ToS prohibit) | Likely blocked |
| Academic / research datasets | Medium | Medium | Low | Research TDM applies |
| Synthetic / generated data | Low | Low (if no PII) | N/A | N/A |
| Proprietary / first-party data | Low | Medium (if personal) | Low | N/A |
2. Output liability: defamation, IP infringement, and harmful content
When an AI model generates content that causes harm — a defamatory statement about a real person, a passage that reproduces copyrighted material, medically dangerous advice — the question of who is legally responsible is not resolved by pointing to the user who entered the prompt. Depending on the design, deployment, and marketing of the model, the developer may face direct or contributory liability for outputs they did not produce and could not predict.
Output liability is one of the most contested areas of AI law precisely because the traditional frameworks — defamation law, copyright law, product liability — were designed for human publishers and product manufacturers, not for probabilistic content generators. Courts and regulators are applying these frameworks to AI systems with results that are not yet fully settled, which means the risk profile is uncertain rather than clearly defined.
Defamation and false statements about real people
AI models — particularly large language models — can generate confident, plausible, and entirely false statements about named real individuals: fabricated quotes, invented professional histories, false allegations of criminal conduct. This is the hallucination problem as a legal liability. Defamation requires a false statement of fact about an identifiable person, published to a third party, causing damage to their reputation. AI-generated content can satisfy all these elements. The developer's liability depends on whether the developer is treated as a publisher (directly liable), a platform (potentially insulated under Section 230 in the US or the EU DSA in Europe), or a product manufacturer (liable for design defects). None of these frameworks fits AI cleanly, and litigation is in early stages.
Significant where: the model is marketed as a factual information source, the model does not clearly disclaim the possibility of inaccurate outputs, or the model is designed to generate statements about named individuals.
Copyright infringement in AI-generated outputs
Generative models trained on copyrighted text can reproduce passages from that training data in response to certain prompts — a phenomenon demonstrated in both research (Carlini et al.) and active litigation. The legal question is whether the output constitutes copyright infringement (reproduction of a substantial part of the original work), and if so, whether the developer is directly, contributorily, or vicariously liable. For code generation models, the risk includes reproducing GPL-licensed code without attribution in outputs delivered to commercial developers — creating a licence compliance obligation for users who may be unaware of the provenance of the generated code. GitHub Copilot is currently the subject of a class action on precisely this basis.
Contributory liability risk where the developer knows the model can reproduce copyrighted content and does not implement reasonable guardrails. Direct liability risk for code outputs that reproduce GPL or other licensed code without appropriate disclosure.
Professional advice outputs in regulated domains
AI models deployed in legal, medical, or financial contexts — or general-purpose models used by users seeking advice in these domains — can generate outputs that constitute professional advice without the safeguards required of human professionals. Medical advice that leads a user to take harmful action, legal advice that is incorrect and causes a user to miss a legal deadline, financial advice that results in material financial loss: each is a scenario where a developer may face claims of negligence, product liability, or breach of applicable sector regulation. The disclaimer "this is not professional advice" in the UI may reduce but does not eliminate this risk — particularly where the model is marketed as capable in these domains.
Heightened where: the model is specifically marketed for professional-domain use, the UI does not provide clear and prominent disclaimers, or the model is deployed without human oversight in a high-stakes context.
Discriminatory, harmful, and illegal content outputs
AI models can generate discriminatory content — stereotypical, offensive, or biased outputs that could constitute harassment or discrimination under applicable law in specific deployment contexts. In employment-related applications, discriminatory AI outputs can trigger liability under equal treatment legislation. More broadly, models that can be prompted to produce instructions for illegal activities, content that sexualises minors, or content that incites violence create exposure for both the developer and the deploying organisation. The EU AI Act explicitly prohibits certain categories of AI system output (including real-time biometric surveillance and social scoring) and the EU DSA requires platform-level content moderation that extends to AI-generated content in some deployment contexts.
Regulatory risk under the EU AI Act for prohibited output categories. Civil liability for discriminatory outputs in employment and consumer contexts. Criminal risk in jurisdictions where generating certain content is an offence.
Output filtering and content moderation — implement output-layer filtering for the highest-risk categories (personal data exfiltration, verbatim reproduction of known copyrighted text, medical/legal/financial advice without qualification, prohibited content categories under the EU AI Act)
Clear and prominent disclaimers — ensure UI-level disclaimers are visible, specific (not just "may be inaccurate"), and calibrated to the deployment context; a general disclaimer on a model marketed for medical use is legally insufficient
Copyright reproduction detection — for generative text and code models, implement detection for outputs that closely reproduce training data and add citation or disclosure where reproduction is detected
Sector-specific deployment restrictions — for general-purpose models, consider restricting deployment in high-risk regulated sectors (healthcare, legal, financial services) without a specific compliance assessment and appropriate safeguards for each sector
Document human oversight requirements — for high-stakes deployment contexts, specify in the product documentation and terms of service that human review is required before acting on AI outputs, particularly in professional and decision-making contexts
The Section 230 / DSA platform defence does not clearly apply to AI developers
Section 230 of the US Communications Decency Act protects platforms from liability for third-party content published on them. The EU Digital Services Act provides analogous protections for hosting providers. It is not established that these protections extend to AI model developers whose models generate content in response to user inputs — as opposed to hosting third-party content. Courts in the US have divided on whether AI-generated content is "third-party content" for Section 230 purposes. Developers who assume they are insulated from output liability by these provisions may be significantly underestimating their exposure.
| Output type | Legal basis for claim | Developer liability risk | Key factors | Primary mitigation |
|---|---|---|---|---|
| False statements about real people | Defamation law | High (if marketed as factual) | Marketing, disclaimers, product design | Disclaimers; hallucination filtering |
| Reproduced copyrighted text/code | Copyright infringement | High — active litigation | Training data, output filters, knowledge | Reproduction detection; attribution |
| Medical advice outputs | Negligence; sector regulation | High in sector deployment | Deployment context; UI disclaimers | Sector restriction; human oversight |
| Legal / financial advice | Negligence; regulated advice rules | High in sector deployment | Marketing; prominent disclaimers | Sector restriction; qualification |
| Discriminatory outputs | Equality law; EU AI Act | Medium–High (employment use) | Deployment context; bias testing | Bias testing; deployment restrictions |
| Prohibited content (CSAM, incitement) | Criminal law; EU AI Act | High — potential criminal | Safeguards; reporting obligations | Content moderation; CSAM detection |
3. Licence Compliance Failures: The Open-Source Traps Developers Miss
Open-source AI model licences are not absence of conditions — they are a set of legal obligations that travel with the model weights whenever they are used, fine-tuned, merged, or deployed. The developer who assumes that "open source" means "no legal strings attached" is in the same position as someone who assumes a Creative Commons licence permits any use. The conditions are real, enforceable, and frequently violated by developers who never read them.
Unlike traditional software copyleft, AI model licence compliance is complicated by the fact that models can be combined in ways that are technically invisible: fine-tuning incorporates the upstream model's weights; distillation transfers knowledge from a licenced model; merging blends multiple models into one. Each of these techniques can create downstream licence obligations that the developer has neither identified nor satisfied.
Copyleft contamination
When a developer fine-tunes a model released under a strong copyleft licence — such as the GNU GPL or AGPL — the resulting fine-tuned model is almost certainly a derivative work subject to the same licence. This means the fine-tuned weights must be released under identical terms, any proprietary layers or adaptations become subject to copyleft, and distribution of the fine-tuned model (including as a hosted service in AGPL's case) triggers disclosure obligations.
The developer who builds a commercial product on a GPL base model and sells API access to it without releasing the fine-tuned weights has, in all likelihood, violated the licence. The fact that this is common practice in the market does not make it compliant.
High risk — commercial viability threatRAIL clause pass-through failures
The Responsible AI Licence (RAIL) framework — used in variants by Stability AI, BigScience, and others — combines permissive distribution rights with a list of prohibited use cases. These use restrictions are legally binding on all downstream users and must be passed through to any parties who access the model or derivatives of it.
Common failures include: omitting the prohibited use schedule from a fine-tuned model's licence; deploying the model in a use case that falls within the prohibited list (e.g., certain surveillance applications for OpenRAIL-M); and failing to include the required notices when redistributing. Because RAIL licences vary significantly across model families, developers must review the specific prohibited use schedule for each upstream model they incorporate.
High risk — licence breach, injunctive relief exposureAttribution and notice failures
Many AI model licences require attribution as a condition of use, not merely as a brand guideline. The Meta Llama licence requires inclusion of "Built with Meta Llama" in derivative model names exceeding certain user thresholds. Apache 2.0 requires reproduction of copyright notices and attribution notices. Failure to include required notices in model cards, documentation, and public-facing materials constitutes a licence breach independent of any other compliance failure.
Attribution obligations are frequently omitted in practice because developers treat them as optional courtesy requirements rather than conditions precedent to the licence grant. Courts have consistently enforced attribution requirements in open-source licences as legally binding conditions, not aspirational guidelines.
Medium–high risk — licence termination possibleCommercial threshold breaches
Several major AI model licences impose commercial use restrictions that apply past a defined user or revenue threshold. The original Llama 2 licence required a separate commercial licence for deployments with more than 700 million monthly active users. Other models restrict commercial use entirely without a separate agreement, or require prior written consent for commercial deployment.
The practical risk is not just post-threshold exposure — it is the failure to identify that a commercial licence is needed before a product scales. A startup that builds its entire product on a non-commercial model without a path to a commercial licence may find itself unable to legally continue operating at commercial scale.
High risk — operational continuity threat at scaleIn traditional software, copyleft contamination occurs when GPL-licensed code is incorporated into a larger work. In AI development, contamination can occur through fine-tuning (the resulting weights incorporate the upstream model's learned representations), distillation (where the student model is trained to replicate the teacher model's outputs, which may constitute a derivative work under the teacher's licence), and merging (where the merged model combines weight matrices from multiple source models, each of which carries its own licence terms).
The legal analysis for each technique remains unsettled, but the prudent approach is to treat each technique as creating a derivative work subject to the upstream licence conditions — and to select upstream models accordingly. A model trained exclusively on Apache 2.0 or MIT-licenced base models carries materially less copyleft risk than one incorporating AGPL or GPL components.
RAIL (Responsible AI Licence) and its variants — OpenRAIL, OpenRAIL-M, CreativeML-OpenRAIL — represent a distinct category of licence that is neither standard open-source nor fully proprietary. Understanding their three core components is essential to compliance:
Permissive base
RAIL licences grant broad rights to access, use, copy, modify, merge, publish, and distribute the model — including commercially — subject to the conditions below. The permissive base is comparable to Apache 2.0 in terms of what it allows.
Prohibited use schedule
A legally binding list of prohibited use cases appended to the licence. For OpenRAIL-M (used on many Hugging Face models), this includes uses such as generating illegal content, discriminatory automated decisions, and certain law enforcement applications. These prohibitions must be passed through to all downstream users and recipients.
Pass-through obligation
Any distribution of the model or derivative models — including fine-tuned versions — must include the full prohibited use schedule. Failure to pass through the restrictions means downstream users are not bound, and the distributing developer may face primary liability for uses they could not contractually prevent.
- Contributor Licence Agreements (CLAs) are the mechanism by which open-source projects obtain the right to sublicence contributions under the project licence. A model trained on or incorporating community contributions without a valid CLA may have no valid right to relicence those contributions.
- Many AI model training datasets were assembled using community contributions — prompts, annotations, RLHF feedback — without a CLA being executed. Where that data was used in training, the resulting model's licence to that data may be defective.
- Acquiring a company or model that did not implement CLA processes inherits the defect. Buyers in M&A transactions should conduct specific due diligence on CLA coverage for all community-contributed training data and model code.
- Where a CLA has not been executed, the contributing party retains their original copyright and may be able to assert rights over the contribution — including asserting that it should not be incorporated in a commercially licenced product.
- Developers releasing models as open source should implement a CLA process before accepting any external contributions to training datasets, model code, or evaluation frameworks.
Beyond copyleft, RAIL clauses, and CLA gaps, the most practically significant licence compliance failures arise from simply not reading the licence. The licences for the most widely used base models — Llama 3, Mistral, Falcon, Gemma, Stable Diffusion — all contain materially different conditions. Treating them as interchangeable because they are "open source" is one of the most consistent compliance failures in the AI development market.
The following audit process provides a minimum baseline for licence compliance before deploying any model that incorporates upstream open-source weights:
| Licence / model family | Commercial use | Copyleft / derivative? | RAIL restrictions? | Attribution required? | Key condition |
|---|---|---|---|---|---|
| Apache 2.0 (e.g., many Falcon variants) | Permitted | No — permissive | No | Yes — copyright & NOTICE file | Reproduce notices; state changes to modified files |
| MIT (various smaller models) | Permitted | No — permissive | No | Yes — copyright notice | Include MIT licence text in all distributions |
| Llama 3 Community Licence (Meta) | Conditional | No — custom permissive | Prohibited uses listed | Yes — "Built with Meta Llama" if >700M MAU | Separate commercial licence required at scale threshold; EULA pass-through required |
| Gemma Terms of Use (Google) | Conditional | No — custom permissive | Prohibited uses listed | Yes — include Gemma attribution | Prohibited uses include certain safety-critical applications; pass-through required |
| OpenRAIL-M (BigScience, Stability variants) | Permitted | No — but conditions pass through | Yes — binding schedule | Yes — include licence and use restrictions | Prohibited use schedule must be passed through to all downstream users |
| AGPL v3 (rare in frontier models; some components) | Restricted | Yes — strong copyleft | No | Yes — copyright notices | Network use triggers source disclosure; derivative works must be AGPL |
| Proprietary / research-only (various academic models) | Not permitted | N/A | Varies | Varies | Commercial deployment requires separate licence negotiation; fine-tuning may be prohibited |
Licence compliance in AI development is an ongoing obligation, not a one-time check at release. As upstream models release new versions with revised licence terms, as your deployment scales across commercial thresholds, and as your model is fine-tuned or incorporated into other products by downstream users, the compliance picture changes. The developers who manage this well maintain a licence inventory as a living document — not as a spreadsheet that was correct at launch and has not been reviewed since.
4. Deployment and Regulatory Risks: The Rules Developers Didn't Know Applied
Building an AI model and deploying an AI model are legally distinct activities. The developer who completes training and ships the model to production without engaging with the regulatory environment applicable to that deployment has moved from a development risk profile to an operational one — and the operational risks are, in several jurisdictions, now carrying material penalties. The EU AI Act, sector-specific financial and healthcare regulations, employment law obligations, and GDPR obligations at inference point are not hypothetical future risks. They are current law, and their application to AI developers and deployers is no longer deferred.
The EU AI Act: What Developers Must Understand
The EU AI Act entered into force in August 2024, with provisions applying on a phased basis through 2026 and beyond. It creates a risk-tiered regulatory framework that applies to developers placing AI systems on the EU market, deployers using AI systems within the EU, and providers of general-purpose AI (GPAI) models — regardless of where the developer is incorporated. A US or UK developer whose model is used in the EU is within scope.
Unacceptable risk — banned outright
AI systems that deploy subliminal manipulation, exploit vulnerabilities of specific groups, enable social scoring by public authorities, or conduct real-time biometric identification in public spaces (with narrow exceptions). Developers or deployers placing prohibited systems on the EU market face penalties of up to €35 million or 7% of global annual turnover.
High-risk AI systems — mandatory compliance obligations
AI systems used in critical infrastructure, education, employment, essential services, law enforcement, migration, and justice. High-risk AI systems require: conformity assessment before market placement; registration in the EU database; risk management and quality management systems; technical documentation; human oversight mechanisms; and post-market monitoring. Fines of up to €15 million or 3% of global turnover apply.
Limited-risk AI systems — transparency obligations
AI systems that interact with users (chatbots), generate synthetic content, or create deepfakes must meet transparency obligations — primarily the obligation to disclose to users that they are interacting with an AI system, and to label AI-generated content as such. Failure to comply carries penalties of up to €7.5 million or 1.5% of global turnover.
Minimal-risk AI — voluntary codes of practice
Most AI applications fall in the minimal-risk category — spam filters, AI in video games, inventory management. These are subject only to voluntary codes of practice. However, many developers misclassify higher-risk systems as minimal risk. Classification errors carry the same penalties as substantive non-compliance.
A distinct and critical category applies to general-purpose AI (GPAI) model providers — developers who release foundation models or large language models that can be used for a wide range of downstream applications. GPAI providers have separate obligations under the EU AI Act that apply regardless of how their model is classified by downstream deployers:
Sector-Specific Regulatory Obligations
The EU AI Act creates a horizontal layer of AI-specific regulation. Sector-specific law — which predates the AI Act and will continue operating alongside it — creates additional, overlapping obligations for AI deployed in regulated sectors. Developers who build for these sectors without engaging with sector-specific rules face dual exposure: non-compliance with sector regulation and, increasingly, non-compliance with AI Act provisions that cross-reference sector requirements.
| Sector | Applicable framework | Key AI-specific obligations | Risk level | Regulator |
|---|---|---|---|---|
| Healthcare / medical devices | EU MDR / IVDR, FDA 510(k) / De Novo (US) | AI-assisted diagnostic tools may qualify as medical devices requiring conformity assessment, clinical evidence, post-market surveillance, and registration. Software-as-a-medical-device (SaMD) classification applies to many AI health tools. | Very high | EMA, national competent authorities; FDA (US) |
| Financial services | EU AI Act (high-risk), DORA, MiFID II, SR 11-7 (US) | AI used in credit scoring, AML, fraud detection, and algorithmic trading faces explainability, model validation, fairness testing, and governance requirements. SR 11-7 (US Federal Reserve guidance) requires independent model validation and documentation for models in regulated use. | High | ECB / EBA / national regulators; Federal Reserve, OCC (US) |
| Employment and HR | EU AI Act (high-risk), EEOC guidelines (US), local labour law | AI used in recruitment, performance evaluation, promotion, or dismissal decisions is high-risk under the EU AI Act. US EEOC has issued guidance on AI in hiring under Title VII. NYC Local Law 144 requires bias audits for automated employment decision tools used in NYC. | High | EU national regulators; EEOC, NYC DCWP (US) |
| Legal and professional advice | Legal services regulation, professional conduct rules | AI tools that provide legal, medical, financial, or accounting advice in regulated jurisdictions may constitute unauthorised practice of a profession. Developer liability may arise where the system is marketed in terms that create reliance without adequate disclaimers. | Medium–high | Professional regulatory bodies; courts |
| Education | EU AI Act (high-risk), FERPA (US), national education law | AI used in student assessment, admissions, or monitoring is classified as high-risk under the EU AI Act. FERPA (US) governs the use of student data. Personalisation engines that influence educational outcomes carry additional obligations. | Medium–high | EU national regulators; Dept of Education (US) |
GDPR obligations do not end at data collection. Every time a deployed model processes a prompt or input that contains or relates to an identifiable natural person, GDPR applies to that processing. This means: a lawful basis for processing is required at inference; data minimisation obligations apply to prompt handling and logging; retention limits apply to inference logs; data subject rights (access, erasure, correction) apply to outputs generated about identifiable individuals; and Article 22 automated decision-making obligations may apply where the model's output is the basis for a significant decision affecting a data subject. The developer who has GDPR-compliant training data but GDPR-non-compliant inference infrastructure has solved half the problem.
The regulatory environment for AI is not static. The EU AI Act's implementation guidance is still being developed; US federal AI regulation is evolving through executive order, agency rulemaking, and emerging state law; the UK is pursuing a sector-led rather than horizontal regulatory approach. The developer who builds regulatory compliance as a one-time exercise is likely to be non-compliant within eighteen months. The developers who manage this well treat regulatory tracking as a standing function, not a project.
Conclusion: A Six-Step Legal Risk Audit for AI Developers
The legal risks covered in this guide — training data liability, output liability, licence compliance failures, and regulatory exposure — each operate independently. Compliance at one layer does not resolve exposure at another. The developer who resolves their GDPR training data obligations but ignores their EU AI Act deployment obligations has not managed their legal risk: they have managed one quarter of it. The following six-step audit provides a practical framework for identifying and addressing exposure across all four layers.
Audit your training data
Document the source, licence, and lawful processing basis for every dataset used in training. Flag datasets acquired by scraping and assess copyright and GDPR exposure. Verify that robots.txt instructions and terms of service were complied with at the time of collection.
Map your output liability exposure
Identify the categories of output your model generates. Assess defamation risk (factual claims about identifiable individuals), copyright risk (verbatim or near-verbatim reproductions), and professional advice risk (outputs in regulated domains). Document the measures in place to detect and limit each category.
Conduct upstream licence due diligence
Identify all upstream model components. Read each licence. Check for copyleft obligations, RAIL prohibited use schedules, attribution requirements, and commercial thresholds. Document your compliance with each. Implement a CLA process for external contributions before the next release.
Classify your AI system under the EU AI Act
Determine whether you are a GPAI provider (general-purpose model), a high-risk AI system provider, or a lower-risk system. Identify your applicable obligations. If your compute threshold exceeds 10²⁵ FLOPs, assess systemic risk obligations. Begin the technical documentation process regardless of tier.
Assess sector-specific regulation
For each deployment sector your model operates in or is likely to operate in: identify the applicable sector-specific framework. Determine whether your model or application falls within scope. Engage specialist legal advice for healthcare, financial services, and employment deployments before launch.
Build a legal compliance monitoring function
Assign ownership of AI legal compliance to a specific role. Establish a process for monitoring changes to the EU AI Act implementation guidance, US federal and state AI regulation, and sector-specific rulemaking. Schedule annual legal risk reviews timed to coincide with major model updates or new deployment markets.


