What is AI doing...and why does it work (for lawyers) - Part 2 (Hallucinations)

Ritwik Bhattacharya
1w ago (edited)
This is Part 2 in a series explaining how AI works for lawyers. If you have not read Part 1 yet, you can find it here.
If you have used ChatGPT, Claude or Gemini, it is likely you have come across this situation - you receive an answer which you know is factually incorrect. You ask it to check again, and it doubles down on the answer. Then you point out the error in the answer, and it says "You're absolutely right!", apologizes profusely and eventually gives you the correct one. If that sounds familiar, you have experienced what is called a "hallucination".
For lawyers, AI hallucinations are not just embarrassing - they can be professionally dangerous. Across jurisdictions, the message is the same: if you use AI, you are still responsible for what eventually goes to the client or the court. Professional conduct rules require lawyers to verify AI-generated outputs, provide competent representation and not mislead the court (an indicative survey of the relevant rules and cases is in the Annex at the end of this post). A breach of these obligations can result in sanctions from the court, professional disciplinary proceedings, reputational harm and exposure to professional negligence claims. A database tracking instances of AI hallucinations in court submissions has identified 1314 such cases globally, as of 13 April 2026.
To use AI responsibly and in line with professional obligations, lawyers need to understand what hallucinations are, why they occur and how they can be mitigated. That is what this post will cover and by the end, you will realise that hallucination is a broad term that does not mean a single problem, but a family of failure modes - each requiring a different mitigation strategy.
What are "hallucinations"?
"Hallucination" is a word drawn from psychology, where it means an unreal perception that feels real. When applied to LLMs, it has been defined as generated content that is nonsensical or unfaithful to the provided source content - whether inconsistent with the model's training data, its prompt or actual facts of the world.
Other terms have been used to describe this phenomenon - some prefer confabulation (producing plausible but false information without intent to deceive) or fabrication (emphasising that false information is being generated, not merely misperceived). These alternate terms might be more appropriate because the analogy to a psychological phenomenon can be misleading. LLMs are not "seeing" things that do not exist - rather, as we discussed in Part 1, the model is pattern-matching symbols with no awareness that the output is factually wrong. However, we will stick with "hallucination" in this post because the term has become so established that using another can feel jarring.
What causes hallucinations in AI?
What is more interesting is why hallucinations occur. And as we will soon see, not all hallucinations are caused by the same things. The graph below gives a snapshot view of the different causes of hallucinations. It is not exhaustive but it covers the most common and practically significant ones for lawyers.

To see these hallucinations in action, we will be using gpt-3.5-turbo as a learning aid (which can be accessed through the OpenAI API here) - the same model that was used in ChatGPT when it was launched on 30 November 2022. The reason for using an older model is that it hallucinates more readily, making it easier to trigger and demonstrate each failure mode in a predictable way.
It should be noted that hallucination rates have declined significantly in newer AI models, as shown in the chart below (where "Best Model" refers to whichever model achieved the lowest hallucination rate in each benchmark period), due to a combination of increased model size, better training techniques and so-called reasoning models (which we will look at in a future post). However, after a certain point, the hallucination rates tend to plateau rather than continue to fall, meaning this is a risk that lawyers will need to manage for the foreseeable future, not one that will simply disappear with the next model release.

The prompts which trigger hallucinations in gpt-3.5-turbo are unlikely to do so in the latest AI models. Yet the underlying mechanism in these models (next-token prediction) remains the same. Using gpt-3.5-turbo is a useful mental model to understand why hallucinations occur, and at a certain level of complexity in the prompt, why the latest models are also likely to hallucinate in the same way.
This is a screenshot of what accessing the chat interface through the OpenAI API looks like - it is a bit like looking under the hood of a car to see how the engine works. You will see that you can select models going all the way from the earliest one (gpt-3.5-turbo) to gpt-5.4, the latest model at the time of writing. Let us try breaking the AI by triggering different hallucinations.

Model knowledge cut-off
In Part 1, we saw that LLMs are trained on vast amounts of text from the internet, and as such, are a snapshot of the internet. And like a snapshot, it is frozen - the LLM does not keep learning in real time. The only way the LLM itself will learn about later events is if it is specifically trained to do so, when a new generation of the model is released (like gpt-4). Therefore, all LLMs have a knowledge cut-off date. For gpt-3.5-turbo, the cut-off date is September 2021. When it is asked a question like "The name of the U.S. Attorney General is", it answers Merrick Garland, who was the U.S. Attorney General at the time (as you can see from the Wikipedia page).


Key takeaway: A LLM will not by itself have access to any decisions, statutes, regulations or amendments that post-date its knowledge cut-off, unless it is specifically given access to that material. This can produce hallucinations if the answer relies on a position of law that has changed since the cut-off date.
Temperature settings
Even when the training data contains the correct information, there can be a hallucination due to something called temperature, which controls the randomness with which the LLM generates text. The term comes from thermodynamics, describing how particles in a physical system distribute themselves across different energy states - at high temperatures, the molecules become more energetic and behave more randomly, and vice versa. In an LLM, a temperature of 0 means that the model deterministically selects the highest-probability token when generating text. At higher temperatures, the model may select a lower-probability token instead, introducing variety into the output. This is useful for creative tasks like brainstorming case strategies or generating alternative argument framings, where you want the model to explore less obvious possibilities. It is counterproductive for tasks that require precision, like legal research or document drafting, where the most probable answer is usually the correct one. If you have seen the "More Creative" or "More Precise" toggles in CoPilot, the temperature is what is being tweaked.

Let us see how temperature makes a difference. In the screenshot below, we are using a split-screen with the same model, gpt-3.5-turbo - the only difference is that the model on the left (in orange) is set to temperature 0, while the model on the right (in purple) is set to temperature 1.

Let us ask it a question whose answer is not widely known, and therefore, is likely to occur rarely in the training data - "Who was the sixth woman to become a judge of the Supreme Court of India?". We can see from the Wikipedia page that it was R. Banumathi.

In response to this question, the model with temperature 0 consistently gives the correct answer, but the model with temperature 1 sometimes returns Gyan Sudha Mishra (the fourth female judge) or Indu Malhotra (the seventh female judge). It is interesting to see that the answer is probabilistically not far off from the correct one - either of the two names could be plausible answers as they are both female judges immediately preceding and succeeding R. Banumathi. This can be perplexing if you think of LLMs as looking up information, but less so when you remember that the model is making a prediction about the next word, which in this case, is the name of the judge.

Key takeaway: For any task where accuracy matters, use the lowest temperature setting available to ensure a more precise or deterministic answer. However, using zero temperature does not eliminate hallucinations, due to the other failure modes mentioned in this post.
Context limits
Another cause of hallucinations is when the model's context - its working memory - is exceeded. You will recall from Part 1 that there is a limit to how much text the model can process at once, measured in tokens (which roughly corresponds to four characters of English text, as shown below). The latest AI models have massive context limits - the largest one from Meta goes up to 10 million tokens! To see what happens when the context limit is exceeded, we have set the limit in gpt-3.5-turbo to 2048 tokens, or approximately 600 words.

When we ask it to "Count from 1-1000 without abbreviating", we see that it stops processing at 677 because its context limit is reached. It does not tell us in advance that it cannot complete this task given its context limit - it simply stops.

Even when the model's context window is large enough to hold the entire document, there are two problems that can still cause hallucinations. The first is the "lost in the middle" problem: researchers have found that LLMs tend to pay more attention to information at the beginning and end of their context window, and skip or deprioritise information in the middle. The second is the "needle in the haystack" problem: when a single relevant fact is buried among large amounts of irrelevant text, the model may fail to locate it. Both problems mean that simply upgrading to a model with a larger context window - or paying for a "pro" version with a higher token limit - does not automatically solve the problem. The model may still miss what matters.
Key takeaway: When you paste a 200-page contract into a chatbot and ask about a clause on page 97, the LLM may miss it - even though it technically fits within the context window. The same risk applies to long court transcripts, due diligence data rooms or voluminous discovery documents.
Tasks not suitable for LLMs
There are some tasks where a next-token prediction engine is simply not the right tool. A classic example, which has stumped many earlier models, is the question "how many Rs in strawberry?" - gpt-3.5-turbo answers two Rs, which even a child can tell is wrong.

To understand why, recall from Part 1 that LLMs process text as tokens, not as individual characters. The word "strawberry" might be split into tokens like "straw" and "berry", or even processed as a single token. The model never "sees" the discrete letters s-t-r-a-w-b-e-r-r-y the way you do when you scan across the word. It is not counting letters at all - it is predicting the most likely answer based on similar questions in its training data. A similar limitation applies to sorting lists alphabetically, which again requires the model to treat words as sequences of discrete letters.
For such tasks, while LLM-based AI will give the right answer in many cases, a better option is to use rules-based tools (like Excel formulae or purpose-built contract analysis software).
Key takeaway: When a task requires precise counting, sorting, or character-level analysis (e.g. counting cross-references or verifying that defined terms are used consistently throughout a contract), do not rely on an LLM. Use a rules-based tool instead.
Reasoning limits
Although LLM-based AI can appear to think and reason like we do, it is still doing next-token prediction. The starkest example of this can be seen when we ask gpt-3.5-turbo the following question - "Turn left, then right, then left again. Which direction are you facing?" The correct answer is "left", as shown below.

gpt-3.5-turbo states "you will be facing in the opposite direction from where you started", which is incorrect.

This hallucination occurs because while you and I maintain an internal state called "direction" which we update sequentially after each instruction, the model is just predicting the next token, without tracking such a state. Somewhere in its training data, it has seen that "opposite direction" is a statistically frequent answer to such questions, and so it produces that answer.
Key takeaway: For complex legal analysis involving multi-step reasoning, treat AI output as a first draft that must be checked step by step against the underlying sources. Do not assume that a plausible-sounding conclusion followed a sound reasoning process.
False-premise acceptance
We ask gpt-3.5-turbo a question about Patel v. Mercer Construction Limited [2015] EWCA Civ 1024, and it answers, summarising its impact on the drafting of liquidated damages clauses in standard form construction contracts. There is only one problem - Patel v. Mercer Construction Limited is not a real case. The citation [2015] EWCA Civ 1024 actually relates to an entirely different case.


There are two reasons why this has likely happened, both linked to how the model has been trained:
First, the prompt embeds a false premise which the system tends to accept and elaborate on due to a phenomenon known as sycophancy. LLM-based AI tends to be sycophantic because during training, models are rewarded for responses that human annotators find helpful and agreeable - leading to a bias towards accepting the premise in the prompt.
Second, in situations where the model should acknowledge uncertainty or admit that it does not know the answer, it instead guesses, producing a plausible answer. A study by OpenAI found that this is due to training and evaluation mechanisms which reward guessing instead of acknowledging uncertainty - much like how it makes sense to attempt educated guesses in an examination without negative marking. A simple example illustrates this: if an LLM is asked to guess a person's birthday, a random guess has a 1/365 chance (in a non-leap year) of being correct, while saying "I don't know" has no chance of being scored as correct. When measured by benchmarks that only reward correct answers, the model develops an incentive to guess rather than express uncertainty. Therefore, when an LLM does not find a case or encounters a link to an online article behind a paywall that it cannot access, it does not admit uncertainty but instead produces a plausible-sounding answer.
Key takeaway: This is particularly dangerous in legal research. If the premise in the prompt is wrong, the model is unlikely to push back - it will build an elaborate, convincing answer on a false foundation. The tone of the output will be indistinguishable from a correct answer.
Limits in understanding law and jurisprudence
Finally, there are domain-specific attributes of law that make it difficult for LLMs trained on general text to properly reason through legal questions. An example of something that LLMs can struggle with is applying the rule of precedent. In response to a legal query about the position of law, it is not sufficient to just find the most factually similar case. Rather, it must be the right jurisdiction; it must not have been overruled, narrowed or distinguished; it must be applied as per the hierarchy of the court system; and it must be the ratio decidendi or holding of the case instead of an obiter or passing remark, which is often a matter of interpretation rather than factual retrieval.
When we ask gpt-3.5-turbo "What is the test for dishonesty under the Theft Act 1968?", it correctly identifies that it is a statute of England and Wales, and gives us the two-part test established in R v Ghosh, [1982] EWCA Crim J0405-1.

There is a nuance here that has been missed. While R v. Ghosh did lay down this two-part test, a subsequent case (Ivey v Genting Casinos Ltd t/a Crockfords [2017] UKSC 67) revised this understanding by removing the second limb of the test. This becomes apparent when asking the same question to Westlaw Edge as shown below (using such AI-enabled legal research tools is one of the mitigation strategies for these kinds of hallucinations, as we will see shortly). This was a case from 2017 that happened before the knowledge cut-off for gpt-3.5-turbo, and yet, the model was unable to recognise its impact on how the jurisprudence on the test of dishonesty has evolved.

This is consistent with the findings of different benchmarking exercises like the Allens AI Australian law benchmark or the LinksAI English law benchmark - LLMs struggle with questions that required nuanced application of legal principles rather than simple factual retrieval. Even AI-enabled legal research tools, which perform better than general LLM chatbots, are not entirely free from such errors. A Study by Stanford University asked over 200 legal queries in such tools, and found hallucinations caused by misunderstandings of legal reasoning and jurisprudence. It should be noted that these tools have since improved, something that the Study also acknowledged. However, they are not a substitute for a human lawyer.
Key takeaway: AI can surface relevant cases, but it cannot reliably navigate the doctrinal hierarchy that determines whether those cases are good law. Moreover, these tools are no substitute for lawyer experience, judgement and awareness - the kind of understanding that is not easily captured in a dataset, but something more diffuse that only comes with the practice of law.
How can hallucinations be mitigated?
We have covered a lot of ground - looking at seven different causes of hallucinations. Understanding those causes matters because they determine the appropriate mitigation strategy. Approaching a cause of hallucination with an inappropriate mitigation strategy will not help: if the problem is that the relevant fact occurred after the model's cut-off date, using a "bigger" or "better" model will not fix it.
Before we go through each mitigation strategy, it is helpful to see how the causes and mitigations map to each other. The table below provides this mapping, along with concrete takeaways for individual lawyers and examples of how these strategies have been implemented in AI tools. This table is best used as a reference to return to after reading the explanations that follow.

Let us now look at each strategy in turn and how it connects to the causes discussed above.
Retrieval Augmented Generation (RAG)
Addresses: knowledge cut-off and context limits
The most intuitive fix for the knowledge cut-off problem is to give the LLM access to current information at the point of answering. This is the idea behind Retrieval Augmented Generation, or RAG - instead of relying solely on what the LLM learned during training, the system first retrieves relevant information from an up-to-date source (like a legal database, a set of documents, or even the internet), and then feeds that information to the LLM alongside your question. The LLM generates its answer based on the retrieved content rather than its frozen training data. ChatGPT search works this way: when you ask a question that requires current information, it makes search queries based on your prompt and uses the results to generate an up-to-date answer. RAG does not make the LLM smarter. It makes it better informed.
RAG also helps address context limits. Rather than pasting an entire document into the LLM, RAG systems break long documents into smaller chunks, use semantic search to identify which chunks are most relevant to your question, and then feed only those chunks to the LLM. Harvey, for instance, makes use of RAG to "enable efficient searching across massive datasets". For lawyers working with due diligence data rooms, lengthy contracts, or voluminous court transcripts, this approach means the LLM does not need to read everything at once - it just needs to find and read the right parts. A complementary strategy is to break a complex task over a long document into discrete sub-tasks, so that each sub-task stays within the LLM's effective memory and attention span.
Temperature settings
Addresses: temperature-induced randomness
As we saw, higher temperatures introduce randomness that can turn a correct answer into a plausible but wrong one. The fix is straightforward: use low temperature settings for tasks where accuracy matters, and reserve higher temperatures for tasks where variety is the point - for instance, when using AI as a brainstorming or sparring partner to develop case strategy or test legal arguments. Most legal AI tools are already set to extremely low or zero temperature to ensure more deterministic answers. If you are using a general-purpose chatbot for legal work, it is worth checking whether the tool gives you any control over this setting, or at least understanding which mode you are in.
Agentic AI and tool calling
Addresses: tasks not suitable for LLMs
Agentic AI and tool calling are strategies to address the problem that some tasks are simply not well suited to next-token prediction. These systems recognise that a question would be better answered by a different kind of tool, and call that tool on your behalf. Claude, for example, uses agentic abilities to identify if your query will be better served by running Python code or an Excel formula, and executes that code instead of generating a text-based response. The LLM becomes an orchestrator rather than the sole performer. For lawyers, this means that if you ask an LLM-based AI to calculate interest on a damages claim or sort a list of creditors alphabetically, the best systems will delegate that work to a tool built for the job rather than attempting to predict the answer token by token.
Reasoning models
Addresses: reasoning limits
Reasoning models are a newer class of LLMs that attempt to address reasoning limits. Models like OpenAI's o3, Anthropic's Opus 4.6, Gemini 3.1 Pro and DeepSeek R1 are designed to break complex problems into intermediate steps, effectively "thinking" before answering. The improvement on reasoning tasks is marked. It is worth noting, though, that these models are still predicting tokens - they have been trained to produce chains of reasoning that lead to better answers, but the underlying mechanism has not changed. Think of it as the difference between a law student who blurts out the first answer that comes to mind and one who has been trained to work through an Issue-Rule-Analysis-Conclusion (IRAC) structure before responding. The second student will get more answers right, not because they are fundamentally smarter, but because the process catches errors that the instinctive response would miss.
Better prompting and verification
Addresses: false-premise acceptance
As we saw, false-premise acceptance is partly a prompting problem and partly a verification problem.
Better prompting means avoiding false premises or explicitly asking the model to challenge assumptions. A good mental model is to follow the best practices from trial advocacy, where leading questions must be avoided during examinations-in-chief to not bias the answer. Asking "How did Patel v. Mercer Construction Ltd affect liquidated damages clauses?" presupposes the case exists. Asking "Does the case Patel v. Mercer Construction Ltd [2015] EWCA Civ 1024 exist, and if so, what did it decide?" gives the model room to push back.
On the verification side, legal research tools like Lexis + AI and Westlaw Edge allow every cited case to be verified against an authoritative legal database. This treats the LLM's output as a draft to be checked, not a finished product to be trusted. For lawyers, the principle is familiar: you would not cite a case in a skeleton argument without pulling it up and reading it yourself. The same discipline applies when the citation comes from an AI.
Domain-specific skills and lawyer judgement
Addresses: limits in understanding legal reasoning and jurisprudence
The limitations around legal reasoning and jurisprudence are the hardest to mitigate with technology alone, because they require the kind of domain expertise that general-purpose LLMs were never trained to have. One partial solution is the use of skills or plugins that layer legal-specific knowledge on top of a general model. The Claude Cowork Legal Plugin contains domain-specific system prompts and workflow maps that provide structured pathways for processing legal requests. For example, the review-contract skill directs the LLM to look at a playbook that contains contractual positions graded as standard, fall-back or unacceptable, and analyses contracts on this basis.
Ultimately, the mitigation for this category of hallucination is lawyer judgement. The model can surface information, draft text, and identify patterns. But the responsibility for applying legal reasoning, checking jurisdictional relevance, and distinguishing ratio from obiter remains with the lawyer. This is not a limitation to be frustrated by. It is, in many ways, the strongest argument for why lawyers remain essential in a world of increasingly capable AI.
There is a lot more to be said about these mitigation strategies - which will be covered in detail in future posts.
Conclusion
At this point, you might be wondering why we should be satisfied with mitigating hallucinations - why not eliminate them completely? It has been proven mathematically that this cannot be done, that hallucinations are a permanent, structural inevitability of how LLMs work.
We can arrive at the same conclusion through logic. In the real world, there are an infinite number of situations and questions that can arise. LLMs, even though they have been trained on a vast amount of data drawn from across the internet, are finite systems that have not seen all possible situations. No matter how much data the LLM has seen or how good it is at prediction, a finite system cannot encompass an infinite reality. This is further compounded by the tendency of LLMs to guess instead of acknowledging uncertainty. Therefore, even though the rate of hallucinations has been dramatically reduced by most benchmarks, there is a non-zero chance of hallucinations occurring, and to quote from the TV show Suits, "the law is a precise endeavour" - even a small chance must be accounted for.

Consider a thought experiment (first shared with me by Alistair Wye): imagine we had an AI system that produced 100% accurate legal outputs, every single time. Would lawyers still need to review and verify those outputs before relying on them? The answer, perhaps surprisingly, is yes - because the need to verify is not solely a function of the AI being unreliable. It is a function of what it means to be a lawyer. Professional conduct rules require that legal advice and court submissions are the responsibility of a qualified, regulated individual. Professional indemnity insurance operates on the same premise: a qualified lawyer must have exercised their professional judgement in the advice given or the work produced. Therefore, the human review step is not a temporary workaround for an immature technology. It is possible that over time, regulatory and market practice will gradually accept the use of autonomous AI for certain tasks, which are sufficiently low risk and where AI can demonstrate near-perfect accuracy. But for the foreseeable future, the lawyer's role in reviewing and verifying AI output remains essential.
What this means in practice is that if a LegalTech vendor claims their solution is 100% hallucination-free, that claim does not withstand scrutiny because it is inconsistent with how LLMs work. It is more productive to evaluate LegalTech solutions against the mitigation strategies discussed above, specifically those that aid the review and verification by a human lawyer. Many improvements in legal AI tools focus on making this verification process easier - for instance, with clickable citations (as can be seen in the screenshot from Harvey below).

Another improvement is increasing transparency about the intermediate steps involved in reaching a conclusion (as can be seen in the screenshot of Legora's workflow below).

What is needed additionally is the awareness and mental models required to overcome some of the cognitive biases that amplify hallucination risks - for example, the illusion of explanatory depth when an LLM produces a plausible-sounding explanation or summary. That is why what is increasingly emphasised is not just compliance, but education. The American Bar Association (ABA) makes clear that "lawyers must have a reasonable understanding of the capabilities and limitations" of AI (Page 2, ABA Formal Opinion 512). The Ministry of Law, Singapore, calls on lawyers to develop AI literacy, specifically to understand "(i) how AI tools function and their limitations, (ii) when AI tools are likely to generate reliable output and when they are not, (iii) basic prompting techniques to reduce hallucination and bias, (iv) that AI competency varies across legal tasks, and (v) when and by whom additional scrutiny should be exercised when reviewing GenAI output" (Para 20, Guide for Using Generative AI in the Legal Sector). The Lady Chief Justice of England and Wales called for "more training and support for lawyers...to enable them to use AI circumspectly and usefully" (Mansion House speech by the Lady Chief Justice on 3 July 2025).
This series of posts is an attempt to build that awareness and understanding. If Part 1 gave you the conceptual framework to understand what AI is and how it works, Part 2 has tried to give you the framework to understand where it breaks down and why. Together, they are the foundation for using AI not with blind trust or blanket scepticism, but with the kind of informed, critical engagement that the legal profession demands. Future posts will go deeper into the mitigation strategies discussed above, including RAG, reasoning models, agentic AI, prompt engineering and skills.
Annex: Professional rules and instances re AI hallucinations
The professional obligations relating to AI hallucinations span multiple jurisdictions. Below is a survey of some key rules, cases and instances.
Duty not to mislead the Court:
USA: The duty not to mislead the Court "by relying on fake opinions" (Page 6, Whiting v. City of Athens, Tennessee, Nos. 24-5918/5919, 25-5424)
Australia: Ensuring AI-prepared documents are "not likely to mislead their client, the court, or another party" (Statement on the Use of Artificial Intelligence in Australian Legal Practice)
Duty of competence:
USA: The duty to "provide competent representation" (Page 4, ABA Formal Opinion 512)
Canada: To exercise "(c)ompetence in the selection and use of any technology tools" (Para 46, Zhang v Chen, 2024 BCSC 285 (CanLII))
Duty of verification:
Singapore: Materials put before the Court must be "independently verified, accurate, true, and appropriate" (Principle 3(2)(a), Supreme Court of the Republic of Singapore, Registrar's Circular No. 1 of 2024)
Supervisory responsibility:
USA: Lawyer have a supervisory responsibility for nonlawyer assistance, including AI outputs (Rule 5.3, ABA Model Rules of Professional Conduct)
England and Wales: The High Court has described this responsibility as being "no different from the responsibility of a lawyer who relies on the work of a trainee solicitor or a pupil barrister for example, or on information obtained from an internet search" (Para 8, R (on the application of Ayinde) v London Borough of Haringey, [2025] EWHC 1040 (Admin))
Hallucinations outside the litigation context
While AI hallucinations are more likely to be detected in the litigation context due to enhanced scrutiny by opposing lawyers in an adversarial system, there have also been instances of AI hallucinations in other contexts - the most high profile being Deloitte's report for the Australian Government, which contained multiple inaccuracies attributable to AI hallucinations. Apart from the severe reputational harm for Deloitte, it also had to forfeit the final payment for the report.