Is My AI Really Accurate?
Associate 1: “I use an AI tool which claims an accuracy rate of over 95%!”
Associate 2: “I use the same one. It cited judgements that don’t exist!”
How does one reconcile these statements? Is one of the statements false, or are they rooted in different understandings of ‘accuracy’? When an AI platform boasts a high accuracy, say 95%, it’s important to address first, what factors determine this percentage; and second, whether the missing 5% renders the AI useless.
How should we determine accuracy?
Accuracy – ironically enough – lacks an accurate definition. Let’s consider an example – a lawyer, Martha, is reviewing a set of facility agreements containing 50 lender consent requirements. Martha is comparing three AI platforms to extract these lender consent provisions. Model A catches 49 requirements, but provides a pointed response with citations. Model B catches all 50, but also catches several other irrelevant provisions, providing a painfully long response. Model C catches all 50 consent provisions, but also invents 3 more. How would Martha rank the accuracy of these models?
Most would immediately discard Model C. This is what we call ‘hallucinations’ – where the AI sees what does not exist. Hallucinations tend to erode the ‘trust’ between the lawyer and the AI tool. If the lawyer has to question the veracity of every sentence, the AI platform in fact increases the lawyer’s workload. But this is where accuracy calculations get tricky. A claim of ‘90% accuracy’ could mean that the AI tool hallucinates only in 10 documents from a sample size of 100. But if the lawyer constantly fears hallucination and has to verify every response, this 90% accuracy is as good as nil.
Therefore, an AI platform should achieve a zero or near-zero level of hallucination, before even starting to calculate ‘accuracy’. Assuming Models A and B have achieved this, which one is more accurate? Model B technically achieves 100% accuracy, but is too verbose and over-inclusive to add any practical utility. Model A omits one clause but provides pointed insights and the guides the lawyer in the right direction. Many would thus prefer Model A (98% accuracy) over Model B (100% accuracy). It is crucial to understand these nuances and therefore evaluate accuracy along with practical utility. This requires us to explore the concept of ‘utility’ in a little more depth.
Does utility begin at 100% accuracy?
Let’s begin by dissecting the status quo. Today, the industry relies on early-stage lawyers who are often fatigued and overworked. When a junior lawyer, at 2AM, is given a task to summarize 5 judgements or identify consent requirements across 50 facility agreements, this individual is unlikely to provide a 100% accurate output. Yet, the industry functions with full force, through a combination of trust and supervision. So while faultless accuracy is desirable, it’s not indispensable for utility. Here, it’s useful to divide legal work into two broad categories – high stakes, and low stakes. Let’s take Pool A of tasks – things like sorting a data room, preparing internal trackers, drafting the first cut of a list of dates, and summarizing judgements for a research note. Now let’s take Pool B – things like identifying change of control restrictions for an IPO, drafting a complex appeal, checking legal compliance of a contract, and identifying red flags in a takeover deal. While the categorizing of tasks into ‘high stakes’ and ‘low stakes’ is subjective and organization-specific, most would consider Pool B to carry higher stakes. Yet, Pool A takes up as much time of the lawyer, and often remains unbilled to the client.
In this scenario, automating Pool A tasks offers an immediate ROI, freeing up bandwidth for higher-value work. A legally trained AI (in the form of Model A), is immediately capable of automating these tasks. Minor errors can be easily addressed during review, ensuring that lawyers don’t spend countless hours on tasks that don’t require deep legal expertise.
For Pool B tasks – those requiring strategic thought, nuanced legal understanding, and a focus on client outcomes – trust in AI becomes crucial. While a 95% accurate AI (like Model A) may not automate these tasks entirely, it can serve as a powerful ally. Imagine drafting a complex appeal: the AI can immediately highlight relevant precedents, spot inconsistencies, or flag missing issues. You may be required to apply these insights to the final draft, but it will amplify your efficiency, acting as a second pair of eyes and even spotting patterns or issues you might have missed. Even today, when you work with a new colleague or junior, you don’t trust them straight way – especially for the high stakes work. Your AI tool is no different. At first, it’s a bit of a “prove yourself” situation. But over time, you’ll learn how best to harness its power for the most critical legal work. Those who master that dance will deliver the most impactful legal services with speed, thoroughness, and precision.
The key takeaways? First, accuracy should only be judged in context of utility; and second, while it’s important to test the AI against itself, it’s equally important to test the AI against status quo. The legal world today is imperfect. The right AI tools can help augment human intelligence and empower lawyers to deliver better quality output, efficiently.
Is My AI Really Accurate?
Associate 1: “I use an AI tool which claims an accuracy rate of over 95%!”
Associate 2: “I use the same one. It cited judgements that don’t exist!”
How does one reconcile these statements? Is one of the statements false, or are they rooted in different understandings of ‘accuracy’? When an AI platform boasts a high accuracy, say 95%, it’s important to address first, what factors determine this percentage; and second, whether the missing 5% renders the AI useless.
How should we determine accuracy?
Accuracy – ironically enough – lacks an accurate definition. Let’s consider an example – a lawyer, Martha, is reviewing a set of facility agreements containing 50 lender consent requirements. Martha is comparing three AI platforms to extract these lender consent provisions. Model A catches 49 requirements, but provides a pointed response with citations. Model B catches all 50, but also catches several other irrelevant provisions, providing a painfully long response. Model C catches all 50 consent provisions, but also invents 3 more. How would Martha rank the accuracy of these models?
Most would immediately discard Model C. This is what we call ‘hallucinations’ – where the AI sees what does not exist. Hallucinations tend to erode the ‘trust’ between the lawyer and the AI tool. If the lawyer has to question the veracity of every sentence, the AI platform in fact increases the lawyer’s workload. But this is where accuracy calculations get tricky. A claim of ‘90% accuracy’ could mean that the AI tool hallucinates only in 10 documents from a sample size of 100. But if the lawyer constantly fears hallucination and has to verify every response, this 90% accuracy is as good as nil.
Therefore, an AI platform should achieve a zero or near-zero level of hallucination, before even starting to calculate ‘accuracy’. Assuming Models A and B have achieved this, which one is more accurate? Model B technically achieves 100% accuracy, but is too verbose and over-inclusive to add any practical utility. Model A omits one clause but provides pointed insights and the guides the lawyer in the right direction. Many would thus prefer Model A (98% accuracy) over Model B (100% accuracy). It is crucial to understand these nuances and therefore evaluate accuracy along with practical utility. This requires us to explore the concept of ‘utility’ in a little more depth.
Does utility begin at 100% accuracy?
Let’s begin by dissecting the status quo. Today, the industry relies on early-stage lawyers who are often fatigued and overworked. When a junior lawyer, at 2AM, is given a task to summarize 5 judgements or identify consent requirements across 50 facility agreements, this individual is unlikely to provide a 100% accurate output. Yet, the industry functions with full force, through a combination of trust and supervision. So while faultless accuracy is desirable, it’s not indispensable for utility. Here, it’s useful to divide legal work into two broad categories – high stakes, and low stakes. Let’s take Pool A of tasks – things like sorting a data room, preparing internal trackers, drafting the first cut of a list of dates, and summarizing judgements for a research note. Now let’s take Pool B – things like identifying change of control restrictions for an IPO, drafting a complex appeal, checking legal compliance of a contract, and identifying red flags in a takeover deal. While the categorizing of tasks into ‘high stakes’ and ‘low stakes’ is subjective and organization-specific, most would consider Pool B to carry higher stakes. Yet, Pool A takes up as much time of the lawyer, and often remains unbilled to the client.
In this scenario, automating Pool A tasks offers an immediate ROI, freeing up bandwidth for higher-value work. A legally trained AI (in the form of Model A), is immediately capable of automating these tasks. Minor errors can be easily addressed during review, ensuring that lawyers don’t spend countless hours on tasks that don’t require deep legal expertise.
For Pool B tasks – those requiring strategic thought, nuanced legal understanding, and a focus on client outcomes – trust in AI becomes crucial. While a 95% accurate AI (like Model A) may not automate these tasks entirely, it can serve as a powerful ally. Imagine drafting a complex appeal: the AI can immediately highlight relevant precedents, spot inconsistencies, or flag missing issues. You may be required to apply these insights to the final draft, but it will amplify your efficiency, acting as a second pair of eyes and even spotting patterns or issues you might have missed. Even today, when you work with a new colleague or junior, you don’t trust them straight way – especially for the high stakes work. Your AI tool is no different. At first, it’s a bit of a “prove yourself” situation. But over time, you’ll learn how best to harness its power for the most critical legal work. Those who master that dance will deliver the most impactful legal services with speed, thoroughness, and precision.
The key takeaways? First, accuracy should only be judged in context of utility; and second, while it’s important to test the AI against itself, it’s equally important to test the AI against status quo. The legal world today is imperfect. The right AI tools can help augment human intelligence and empower lawyers to deliver better quality output, efficiently.