Analysis of AI Models for Extracting Legal Agreement Data from the CUAD Dataset
Introduction
The integration of artificial intelligence into the legal industry is revolutionizing how legal documents are reviewed and managed. Rather than specially-trained models, foundational large language models are increasingly being used for many different tasks like extracting key information from contracts. This report compares the performance of AI language models in extracting key contract information from contracts from the Contract Understanding Atticus Dataset (CUAD). CUAD is a collection of legal agreements and contracts from various industries compiled from public EDGAR filings and annotated for over 40 types of legal clauses like governing law, expiration dates, liabilities, and parties involved. It includes complex documents such as amendments and multilateral agreements. We found that while these models show promise in legal data extraction, their performance leaves room for improvement, with each model exhibiting unique behaviors that should inform their optimal use in specific applications.
Methodology
The initial models tested included the following:
Claude 3.5 Sonnet (2024-06-20), by Anthropic
GPT-4o, by OpenAI
GPT-4o-mini, by OpenAI
Llama 405b Instruct, by Meta
We evaluated their ability to extract the following:
Parties involved
Document name
Governing law
Effective date
Expiration date
Dataset
We used the CUAD dataset for this analysis, which includes a diverse range of legal agreements. Claude 3.5 Sonnet and GPT-4o-mini were evaluated on 510 agreements, but due to limitations in some models' ability to process long documents, GPT-4o was evaluated on 454 agreements, and Llama 40.5b Instruct was evaluated on 159 agreements. Llama's significantly smaller dataset likely consisted of simpler agreements, potentially affecting its performance positively.
Prompts
We used the same simple prompt for all models on each extraction task to maintain consistency. The prompt was intentionally basic to assess each model's baseline performance without advanced optimization.
Extraction Task | Prompt |
---|---|
Identify the parties to the following agreement. | Provide only the names of the parties. If you cannot identify the parties, return "None". Do not guess.{input} |
Determine the name of the following contract. | Provide only the name of the agreement. If the contract has no name or you cannot determine the name, return "None". Do not guess. {input} |
Determine the location of the governing law from the following agreement. | Provide only the location of the governing law. If there is no governing law, return "None". Do not guess.{input} |
Determine the effective date of the following contract. | Provide only the effective date in the format [MONTH]/[DATE]/[YEAR]. If the contract has no effective date, return "None". Do not guess.{input} |
Determine the expiration date of the following contract. | Provide only the expiration date in the format [MONTH]/[DATE]/[YEAR]. If you cannot determine it, return an empty string. Do not guess.{input} |
This prompt lacked detailed instructions or context, which likely influenced the models' performance, especially with complex legal documents. Using the same basic prompt for all models allows us to compare their inherent capabilities directly, and serves as a practical point of reference for developers who start with simple prompts in the early stages of their products. An improved prompt would likely result in improved performance on these tasks, and we intend to run the same tests with improved prompts as a next step.
Results
Below is a summary of each model's performance in extracting the specified data points.
1. Parties Involved
Model | Correct Extractions | Accuracy |
---|---|---|
Claude 3.5 Sonnet | 490/510 | 94.23% |
GPT-4o | 433/454 | 95.37% |
GPT-4o-mini | 495/510 | 95.19% |
Llama 405b Instruct | 152/159 | 95.60% |
Observations: All models struggled with agreements involving multiple parties. This could have been due to the prompt not being explicit as to how to handle such cases. The models also tended to identify named entities such as person names and company names as parties, even if they were not parties to the agreement, but were in the agreement for another reason.
2. Document Name
Model | Correct Extractions | Accuracy |
---|---|---|
Claude 3.5 Sonnet | 487/510 | 93.65% |
GPT-4o | 431/454 | 94.93% |
GPT-4o-mini | 502/510 | 96.54% |
Llama 405b Instruct | 143/159 | 89.94% |
Observations: The models performed similarly well on this task, though all of the models had trouble identifying names of amendments. We note that this dataset contained many agreements that were confusingly formatted or named, or were attached as exhibits to an SEC filing. The models would likely have performed better on a cleaner dataset.
3. Governing Law
Model | Correct Extractions | Accuracy |
---|---|---|
Claude 3.5 Sonnet | 376/510 | 72.31% |
GPT-4o | 444/454 | 97.80% |
GPT-4o-mini | 446/510 | 85.77% |
Llama 405b Instruct | 151/159 | 94.97% |
Observations: GPT-4o performed admirably on this task, solidly beating each of the other models. Claude 3.5 Sonnet and GPT-4o-mini sometimes hallucinated governing laws, particularly if there were references to another location like venue. Llama showed high accuracy but likely benefited from a simpler dataset due to its smaller context window.
4. Effective Date
Model | Correct Extractions | Accuracy |
---|---|---|
Claude 3.5 Sonnet | 385/510 | 74.04% |
GPT-4o | 376/454 | 82.82% |
GPT-4o-mini | 355/510 | 68.27% |
Llama 405b Instruct | 104/159 | 65.41% |
Observations: GPT-4o-mini often confused effective dates with agreement dates and sometimes provided incorrect dates. GPT-4o also exhibited the tendency to hallucinate effective dates where there were none, though it was the most accurate on its dataset. Claude 3.5 Sonnet was better at avoiding guesses and interpreting dates referenced elsewhere.
5. Expiration Date
Model | Correct Extractions | Accuracy |
---|---|---|
Claude 3.5 Sonnet | 399/510 | 76.73% |
GPT-4o | 340/454 | 69.7% |
GPT-4o-mini | 258/510 | 49.62% |
Llama 405b Instruct | 141/159 | 88.68% |
Observations: GPT-4o-mini tended to guess expiration dates when none were provided or when the term was perpetual. GPT-4o and Claude 3.5 Sonnet were less likely to guess in such situations. Once again, Llama showed high accuracy likely due to having to deal with smaller agreements.
Summary
Out of the box and with simple, direct prompts, Claude 3.5 Sonnet, GPT-4o, and GPT-4o-mini both show comparable performance on extraction tasks, though there is a lot of room for improvement before the performance reaches the standards of production applications. In general, Claude 3.5 Sonnet showed more caution in avoiding guesses where information was missing or unclear. GPT-4o and GPT-4o-mini performed well, though they were more prone to making assumptions and hallucinating an answer. Llama 405b Instruct could not be directly compared to the other two models given its smaller context window, but it showed promise on the data points it was able to predict.
Future Work
More Models: Run the analysis on more models, including OpenAI’s newest reasoning model, gpt-o1.
Refine Prompts: Test with more sophisticated prompts to see if performance improves.
Additional Datasets: Evaluate on additional data for more real-world use-cases.
Community Engagement
By sharing our findings, we aim to help the AI and legal communities determine which model best fits their applications. We also invite feedback and suggestions to improve this report, including recommendations for additional metrics, prompt strategies, or analysis methods. Please feel free to contact us at jack@ownlayer.com with any comments, questions, or suggestions.