Case Study: Enhancing Legal Question Answering with Human Feedback

8 min readJul 14, 2023

Introduction

Here, we delve into a scenario where a legal expert, Mia, seeks to access deep insights from a large collection of legal case studies. This is what a case study looks like on Anote.

Mia had 500 documents, each on average 15 to 25 pages, and she had the following questions she wanted to answer on this document:

What is the date of the appellate case?
What is the title of the document?
What is the jurisdiction of the case?
Which court handled the appeal?
Which court handled the original trial and sentencing?
What was the date of the original sentencing?
Was the original judgement affirmed?
What is the gender of the judge handling the appeal?
What is the name of the judge handling the appeal?
What is the gender of the original judge?
What is the gender of the defendant?
What is the age of the defendant?
Does the defendant have a mental disorder? If yes, provide the first disorder.
Does the defendant have a second mental disorder? If yes, provide the second disorder.
Are there more than two disorders?
Does the defendant have a criminal history?
What is the most severe charge for which the defendant was found guilty?
What is the second charge or count for which the defendant was found guilty?
What is the third charge or count for which the defendant was found guilty?
Are there more than three charges?
Did the defendant plead guilty or no contest?
Was there a jury trial?
Is the death penalty involved in the case?
Will there be imprisonment for the defendant?
If there is imprisonment, what is the length in months?
Is there any institutionalization in a hospital or civil commitment?
Will there be post-release supervision?
If there is post-release supervision, what is the length in months?
Is probation given instead of imprisonment?
If there is probation, what is the length in months?
Is there a rehab or treatment order?
Is there a fine or restitution involved?

Initially, she went through 120 of these documents and typed the results in an excel spreadsheet, which was tedious and took over 3 months of work. She wanted to see whether AI could help automate this process, so she looked into LLMs like ChatGPT, to generate initial answers to these questions. However, due to its general nature, the model struggled to provide accurate responses for the specific legal domain. Here is an example of ChatGPT, as well as other general purpose LLMs, output:

While it would be great if we could get the model response and the citation (source document, source content) for each question we ask, unfortunately the zero shot ChatGPT LLM doesn’t know a lot of the answers in the legal domain initially, so the model answers are not too helpful. As a result, Mia is still going through these documents manually, in a spreadsheet. Here is an example of what the first row of spreadsheet looks like:

Mia is looking for a better way to use LLMs to help here, and went to Anote for assistance.

Problem Statement

The problem statement is to create an effective, AI-enabled solution that not only provides accurate answers to complex legal questions but also continually improves its performance with human feedback, thereby unlocking crucial data currently beyond reach, conserving resources, and reducing costs. With high quality model results, we would be able to streamline the process of manually reviewing and extracting information from the 120 long legal documents, thereby reducing the time and making the process less tedious.

Proposed Solution

Section: Data Split and Initial Evaluation

To ensure the effectiveness of our legal question answering system, we divided our dataset into training data and testing data. We used a total of 120 legal documents for this study, with 20 documents reserved for testing purposes and 100 documents used for training. The testing data, consisting of the 20 selected legal documents, served as our evaluation set.

Initial Predictions:

We initiated the process by using ChatGPT’s zero-shot capabilities to generate preliminary answers for the legal questions present in all of the documents in the testing dataset. We employed the initial zero-shot predictions from ChatGPT to generate answers for the following 6 questions on the testing data:

What was the gender of the appeal judge? Please answer ‘m’ for male and ‘f’ for female.
What was the gender of the defendant? Please answer ‘m’ for male and ‘f’ for female.
What was the first most severe charge for which the defendant was found guilty?
What was the jurisdiction? Please provide the initials of the jurisdiction.
Did the defendant have any history of criminal activity? Please answer ‘yes’ or ‘no’.
What was the second charge for which the defendant was found guilty?

Human Feedback Integration:

Expert legal practitioners meticulously reviewed the model’s initial answers on the first training document and provided feedback on any inaccuracies or deficiencies they identified. This feedback was recorded and used to improve the model’s subsequent performance.

feedback_data = [
{
  "question": "What was the Appeal Judge's gender? Please answer 'm' for male and 'f' for female.",
  "prediction": "m",
  "actual_response": "f"
},
{
  "question": "What was the gender of the defendant? Please answer 'm' for male and 'f' for female.",
  "prediction": "f",
  "actual_response": "m"
},
{
  "question": "What was the first most severe charge for which the defendant was found guilty?",
  "prediction": "Assault",
  "actual_response": "Robbery"
},
{
  "question": "What was the Jurisdiction? Please provide the initials of the Jurisdiction.",
  "prediction": "NY",
  "actual_response": "CA"
},
{
  "question": "Did the defendant have any history of criminal activity? Please answer 'yes' or 'no'.",
  "prediction": "yes",
  "actual_response": "no"
},
{
  "question": "What was the second charge for which the defendant was found guilty?",
  "prediction": "Burglary",
  "actual_response": "Drug possession"
}
]

Fine-tuning:

With the obtained feedback on the first document, we incorporated it into the training data and fine-tuned the model using the feedback-inclusive prompts. This fine-tuning process aimed to enhance the model’s understanding and accuracy on the specific legal questions within the first document. Through this process, the model is equipped with capabilities to access insights from legal documents, capabilities it initially lacked, marking a significant evolution in its learning process.

Repetition for Subsequent Documents:

We repeated the above process for each subsequent document in the training dataset. This iterative approach allowed us to accumulate feedback and progressively refine the model’s performance with each document review. By integrating human feedback and iteratively fine-tuning the model, we observed a progressive improvement in the accuracy and quality of the answers provided by the system.

feedback_data = []
k = 0
while k < numTrainDocs:
  # FOR EACH ROW IN TRAINING DATA
  TRAIN_MODEL_PREDICTIONS = []
  questions_list = []
  TRAIN_ACTUAL_RESPONSES = []
  doc_title = test_df.loc[k, 'DocTitle']
  # FOR EACH QUESTION WE WANT TO ANSWER
  for j in range(len(QUESTIONS)):
    # ASK THE QUERY TO THE MODEL WITH HUMAN FEEDBACK
    # INITIALY FEEDBACK IS EMPTY, BUT OVER TIME WE INCORPORATE MORE AND MORE FEEDBACK
    query = f"In the court case titled {doc_title}, " + str(QUESTIONS[j]) + str(feedback_data)
    questions_list.append(query)
    #GET MODEL PREDICTION FOR EACH QUERY
    res = qa(query)
    model_prediction = res['result']
    TRAIN_MODEL_PREDICTIONS.append(model_prediction)
    # GET ACTUAL RESPONSE FOR EACH QUERY
    actual_response = test_df.loc[k, QUESTIONS_COLUMNS[j]]
    TRAIN_ACTUAL_RESPONSES.append(actual_response)

  # INCORPORATE FEEDBACK ON TRAINING DATA
  feedback_data = incorporate_feedback(feedback_data, QUESTIONS, TRAIN_MODEL_PREDICTIONS, TRAIN_ACTUAL_RESPONSES)
  # EVALUATE RESULTS ON TESTING DATA
  results_df = get_answers_and_responses_for_test_data(test_df, numTestDocs, qa, feedback_data)

Results

To evaluate our approach, we employed a pre-defined evaluation set comprising 25 groups, with approximately 6 questions in each group. For each training document we added feedback for we would evaluate the feedback on 25 testing dataset documents.

Iteration 1:

Iteration 2:

Iteration 3:

Iteration 4:

Through this process, the model is equipped with capabilities to access insights from legal documents, capabilities it initially lacked, marking a significant evolution in its learning process.

Tutorial Video:

Here is a tutorial video explain how the Anote product is designed to improve generative AI models with the input of subject matter expertise.

https://www.loom.com/share/d6cb3dd6daa34f24ac8fa231fbeaa6be

Private vs. Public LLMs

We are able to obtain these results with both private, on premise LLMs like Llama and GPT4All, as well as public LLMs like OpenAI. This is important in the case where you have sensitive data that you want to keep secure, but would like to leverage LLMs and generative AI to improve your workflow.

def choose_llm(
  isPrivate,
):
if isPrivate == True:
  # Handle PrivateGPT option
  print("You have selected PrivateGPT.")
  # Prepare the LLM
  if model_type == "LlamaCpp":
    llm = LlamaCpp(model_path=model_path,
      n_ctx=model_n_ctx,
      n_batch=model_n_batch,
      verbose=False)
  elif "GPT4All":
    llm = GPT4All(model=model_path,
      backend='gptj',
      n_batch=model_n_batch,
      verbose=False)
elif isPrivate == False:
  # Handle Semi-private option
  print("You have selected Semi-private.")
  llm = OpenAI(temperature=0, openai_api_key=API_KEY)
else:
  print("Invalid choice. isPrivate must be true or false.")
  raise Exception(f"Invalid choice. isPrivate must be true or false.")
return llm

Impact

The impact of our solution was remarkable, significantly reducing the time and effort required for manual review and extraction of information from the 120 lengthy legal documents. The automated nature of our approach saved legal practitioners from the painstaking task of manually scrutinizing each document and entering answers into a spreadsheet. But our solution’s impact extends beyond just remarkable time-saving. It revolutionizes the way legal data is processed, providing access to crucial insights that were previously buried in mountains of paperwork, thus remaining virtually untapped due to the prohibitive time and resource requirements of manual review. Via leveraging human feedback to enhance the model’s understanding of the legal domain, we achieved higher accuracy in legal question answering, resulting in increased efficiency and productivity.