Analysis of AI Models for Extracting Legal Agreement Data from the CUAD Dataset

Introduction

The integration of artificial intelligence into the legal industry is revolutionizing how legal documents are reviewed and managed. Rather than specially-trained models, foundational large language models are increasingly being used for many different tasks like extracting key information from contracts. This report compares the performance of AI language models in extracting key contract information from contracts from the Contract Understanding Atticus Dataset (CUAD). CUAD is a collection of legal agreements and contracts from various industries compiled from public EDGAR filings and annotated for over 40 types of legal clauses like governing law, expiration dates, liabilities, and parties involved. It includes complex documents such as amendments and multilateral agreements. We found that while these models show promise in legal data extraction, their performance leaves room for improvement, with each model exhibiting unique behaviors that should inform their optimal use in specific applications.

Methodology

The initial models tested included the following:

Claude 3.5 Sonnet (2024-06-20), by Anthropic
GPT-4o, by OpenAI
GPT-4o-mini, by OpenAI
Llama 405b Instruct, by Meta

We evaluated their ability to extract the following:

Parties involved
Document name
Governing law
Effective date
Expiration date

Dataset

We used the CUAD dataset for this analysis, which includes a diverse range of legal agreements. Claude 3.5 Sonnet and GPT-4o-mini were evaluated on 510 agreements, but due to limitations in some models' ability to process long documents, GPT-4o was evaluated on 454 agreements, and Llama 40.5b Instruct was evaluated on 159 agreements. Llama's significantly smaller dataset likely consisted of simpler agreements, potentially affecting its performance positively.

Prompts

We used the same simple prompt for all models on each extraction task to maintain consistency. The prompt was intentionally basic to assess each model's baseline performance without advanced optimization.

Extraction Task	Prompt
Identify the parties to the following agreement.	Provide only the names of the parties. If you cannot identify the parties, return "None". Do not guess.{input}
Determine the name of the following contract.	Provide only the name of the agreement. If the contract has no name or you cannot determine the name, return "None". Do not guess. {input}
Determine the location of the governing law from the following agreement.	Provide only the location of the governing law. If there is no governing law, return "None". Do not guess.{input}
Determine the effective date of the following contract.	Provide only the effective date in the format [MONTH]/[DATE]/[YEAR]. If the contract has no effective date, return "None". Do not guess.{input}
Determine the expiration date of the following contract.	Provide only the expiration date in the format [MONTH]/[DATE]/[YEAR]. If you cannot determine it, return an empty string. Do not guess.{input}

This prompt lacked detailed instructions or context, which likely influenced the models' performance, especially with complex legal documents. Using the same basic prompt for all models allows us to compare their inherent capabilities directly, and serves as a practical point of reference for developers who start with simple prompts in the early stages of their products. An improved prompt would likely result in improved performance on these tasks, and we intend to run the same tests with improved prompts as a next step.

Results

Below is a summary of each model's performance in extracting the specified data points.

1. Parties Involved

Model	Correct Extractions	Accuracy
Claude 3.5 Sonnet	490/510	94.23%
GPT-4o	433/454	95.37%
GPT-4o-mini	495/510	95.19%
Llama 405b Instruct	152/159	95.60%

Observations: All models struggled with agreements involving multiple parties. This could have been due to the prompt not being explicit as to how to handle such cases. The models also tended to identify named entities such as person names and company names as parties, even if they were not parties to the agreement, but were in the agreement for another reason.

2. Document Name

Model	Correct Extractions	Accuracy
Claude 3.5 Sonnet	487/510	93.65%
GPT-4o	431/454	94.93%
GPT-4o-mini	502/510	96.54%
Llama 405b Instruct	143/159	89.94%

Observations: The models performed similarly well on this task, though all of the models had trouble identifying names of amendments. We note that this dataset contained many agreements that were confusingly formatted or named, or were attached as exhibits to an SEC filing. The models would likely have performed better on a cleaner dataset.

3. Governing Law

Model	Correct Extractions	Accuracy
Claude 3.5 Sonnet	376/510	72.31%
GPT-4o	444/454	97.80%
GPT-4o-mini	446/510	85.77%
Llama 405b Instruct	151/159	94.97%

Observations: GPT-4o performed admirably on this task, solidly beating each of the other models. Claude 3.5 Sonnet and GPT-4o-mini sometimes hallucinated governing laws, particularly if there were references to another location like venue. Llama showed high accuracy but likely benefited from a simpler dataset due to its smaller context window.

4. Effective Date

Model	Correct Extractions	Accuracy
Claude 3.5 Sonnet	385/510	74.04%
GPT-4o	376/454	82.82%
GPT-4o-mini	355/510	68.27%
Llama 405b Instruct	104/159	65.41%

Observations: GPT-4o-mini often confused effective dates with agreement dates and sometimes provided incorrect dates. GPT-4o also exhibited the tendency to hallucinate effective dates where there were none, though it was the most accurate on its dataset. Claude 3.5 Sonnet was better at avoiding guesses and interpreting dates referenced elsewhere.

5. Expiration Date

Model	Correct Extractions	Accuracy
Claude 3.5 Sonnet	399/510	76.73%
GPT-4o	340/454	69.7%
GPT-4o-mini	258/510	49.62%
Llama 405b Instruct	141/159	88.68%

Observations: GPT-4o-mini tended to guess expiration dates when none were provided or when the term was perpetual. GPT-4o and Claude 3.5 Sonnet were less likely to guess in such situations. Once again, Llama showed high accuracy likely due to having to deal with smaller agreements.

Summary

Out of the box and with simple, direct prompts, Claude 3.5 Sonnet, GPT-4o, and GPT-4o-mini both show comparable performance on extraction tasks, though there is a lot of room for improvement before the performance reaches the standards of production applications. In general, Claude 3.5 Sonnet showed more caution in avoiding guesses where information was missing or unclear. GPT-4o and GPT-4o-mini performed well, though they were more prone to making assumptions and hallucinating an answer. Llama 405b Instruct could not be directly compared to the other two models given its smaller context window, but it showed promise on the data points it was able to predict.

Future Work

More Models: Run the analysis on more models, including OpenAI’s newest reasoning model, gpt-o1.
Refine Prompts: Test with more sophisticated prompts to see if performance improves.
Additional Datasets: Evaluate on additional data for more real-world use-cases.

Community Engagement

By sharing our findings, we aim to help the AI and legal communities determine which model best fits their applications. We also invite feedback and suggestions to improve this report, including recommendations for additional metrics, prompt strategies, or analysis methods. Please feel free to contact us at jack@ownlayer.com with any comments, questions, or suggestions.

Home