Analysis of AI Models for Extracting Legal Agreement Data from the CUAD Dataset

Introduction

The integration of artificial intelligence into the legal industry is revolutionizing how legal documents are reviewed and managed. Rather than specially-trained models, foundational large language models are increasingly being used for many different tasks like extracting key information from contracts. This report compares the performance of AI language models in extracting key contract information from contracts from the Contract Understanding Atticus Dataset (CUAD). CUAD is a collection of legal agreements and contracts from various industries compiled from public EDGAR filings and annotated for over 40 types of legal clauses like governing law, expiration dates, liabilities, and parties involved. It includes complex documents such as amendments and multilateral agreements. We found that while these models show promise in legal data extraction, their performance leaves room for improvement, with each model exhibiting unique behaviors that should inform their optimal use in specific applications.

Methodology

The initial models tested included the following:

  • Claude 3.5 Sonnet (2024-06-20), by Anthropic

  • GPT-4o, by OpenAI

  • GPT-4o-mini, by OpenAI

  • Llama 405b Instruct, by Meta

We evaluated their ability to extract the following:

  • Parties involved

  • Document name

  • Governing law

  • Effective date

  • Expiration date

Dataset

We used the CUAD dataset for this analysis, which includes a diverse range of legal agreements.  Claude 3.5 Sonnet and GPT-4o-mini were evaluated on 510 agreements, but due to limitations in some models' ability to process long documents, GPT-4o was evaluated on 454 agreements, and Llama 40.5b Instruct was evaluated on 159 agreements. Llama's significantly smaller dataset likely consisted of simpler agreements, potentially affecting its performance positively.

Prompts

We used the same simple prompt for all models on each extraction task to maintain consistency. The prompt was intentionally basic to assess each model's baseline performance without advanced optimization.

Extraction TaskPrompt
Identify the parties to the following agreement.Provide only the names of the parties. If you cannot identify the parties, return "None". Do not guess.{input}
Determine the name of the following contract.Provide only the name of the agreement. If the contract has no name or you cannot determine the name, return "None". Do not guess. {input}
Determine the location of the governing law from the following agreement.Provide only the location of the governing law. If there is no governing law, return "None". Do not guess.{input}
Determine the effective date of the following contract.Provide only the effective date in the format [MONTH]/[DATE]/[YEAR]. If the contract has no effective date, return "None". Do not guess.{input}
Determine the expiration date of the following contract.Provide only the expiration date in the format [MONTH]/[DATE]/[YEAR]. If you cannot determine it, return an empty string. Do not guess.{input}

This prompt lacked detailed instructions or context, which likely influenced the models' performance, especially with complex legal documents. Using the same basic prompt for all models allows us to compare their inherent capabilities directly, and serves as a practical point of reference for developers who start with simple prompts in the early stages of their products. An improved prompt would likely result in improved performance on these tasks, and we intend to run the same tests with improved prompts as a next step.

Results

Below is a summary of each model's performance in extracting the specified data points.

1. Parties Involved

ModelCorrect ExtractionsAccuracy
Claude 3.5 Sonnet490/51094.23%
GPT-4o433/45495.37%
GPT-4o-mini495/51095.19%
Llama 405b Instruct152/15995.60%

Observations: All models struggled with agreements involving multiple parties. This could have been due to the prompt not being explicit as to how to handle such cases. The models also tended to identify named entities such as person names and company names as parties, even if they were not parties to the agreement, but were in the agreement for another reason.

2. Document Name

ModelCorrect ExtractionsAccuracy
Claude 3.5 Sonnet487/51093.65%
GPT-4o431/45494.93%
GPT-4o-mini502/51096.54%
Llama 405b Instruct143/15989.94%

Observations: The models performed similarly well on this task, though all of the models had trouble identifying names of amendments. We note that this dataset contained many agreements that were confusingly formatted or named, or were attached as exhibits to an SEC filing. The models would likely have performed better on a cleaner dataset.

3. Governing Law

ModelCorrect ExtractionsAccuracy
Claude 3.5 Sonnet376/51072.31%
GPT-4o444/45497.80%
GPT-4o-mini446/51085.77%
Llama 405b Instruct151/15994.97%

Observations: GPT-4o performed admirably on this task, solidly beating each of the other models. Claude 3.5 Sonnet and GPT-4o-mini sometimes hallucinated governing laws, particularly if there were references to another location like venue.  Llama showed high accuracy but likely benefited from a simpler dataset due to its smaller context window.

4. Effective Date

ModelCorrect ExtractionsAccuracy
Claude 3.5 Sonnet385/51074.04%
GPT-4o376/45482.82%
GPT-4o-mini355/51068.27%
Llama 405b Instruct104/15965.41%

Observations: GPT-4o-mini often confused effective dates with agreement dates and sometimes provided incorrect dates. GPT-4o also exhibited the tendency to hallucinate effective dates where there were none, though it was the most accurate on its dataset. Claude 3.5 Sonnet was better at avoiding guesses and interpreting dates referenced elsewhere.

5. Expiration Date

ModelCorrect ExtractionsAccuracy
Claude 3.5 Sonnet399/51076.73%
GPT-4o340/45469.7%
GPT-4o-mini258/51049.62%
Llama 405b Instruct141/15988.68%

Observations: GPT-4o-mini tended to guess expiration dates when none were provided or when the term was perpetual. GPT-4o and Claude 3.5 Sonnet were less likely to guess in such situations. Once again, Llama showed high accuracy likely due to having to deal with smaller agreements.

Summary

Out of the box and with simple, direct prompts, Claude 3.5 Sonnet, GPT-4o, and GPT-4o-mini both show comparable performance on extraction tasks, though there is a lot of room for improvement before the performance reaches the standards of production applications. In general, Claude 3.5 Sonnet showed more caution in avoiding guesses where information was missing or unclear. GPT-4o and GPT-4o-mini performed well, though they were more prone to making assumptions and hallucinating an answer. Llama 405b Instruct could not be directly compared to the other two models given its smaller context window, but it showed promise on the data points it was able to predict. 

Future Work

  • More Models: Run the analysis on more models, including OpenAI’s newest reasoning model, gpt-o1.

  • Refine Prompts: Test with more sophisticated prompts to see if performance improves.

  • Additional Datasets: Evaluate on additional data for more real-world use-cases.

Community Engagement

By sharing our findings, we aim to help the AI and legal communities determine which model best fits their applications. We also invite feedback and suggestions to improve this report, including recommendations for additional metrics, prompt strategies, or analysis methods. Please feel free to contact us at jack@ownlayer.com with any comments, questions, or suggestions.



From Data to Decision, We Help You Deploy AI with Confidence

© Ownlayer, Inc 2024. All rights reserved.

© Ownlayer, Inc 2024. All rights reserved.

© Ownlayer, Inc 2024. All rights reserved.

From Data to Decision, We Help You Deploy AI with Confidence

From Data to Decision, We Help You Deploy AI with Confidence

From Data to Decision, We Help You Deploy AI with Confidence

From Data to Decision, We Help You Deploy AI with Confidence