Case Study — Enhancing Categorical Classification of Text Data using Generative AI and Human Feedback

Anote
4 min readOct 30, 2023

--

Introduction

This case study delves into a prominent player in the competitor intelligence field, currently facing a challenge with categorizing companies in their dataset. With a collection of 1100 rows of textual information about diverse companies, the competitor intelligence company aims to efficiently categorize and tag each company, including subcategories, using predefined categories and tags. The study focuses on evaluating the effectiveness of human input and labeling in the categorization process, specifically comparing it to the limitations of traditional zero-shot approaches.

Dataset Overview

The text dataset incorporated 1100 records of information such as the URL of the company’s site, the company site’s meta title, the site’s meta description, and manually labeled categories (Tier 1 and Tier 2 categories) that each company’s product or service corresponded to:

These categories are explained in more detail in the next section. The goal of this study was to use few-shot learning techniques to accurately predict the Tier 2 categories of the set. The first 100 rows of the dataset were used as a test set to evaluate the zero and few shot models, and the next 50 rows of the data were used as a ‘training set’ to feed to the few shot models.

Parent Categories and Tags

- Parent Categories (Tier 1): A list of predefined categories used for classification.

- Tags (Tier 2): A set of labels assigned to companies, including subcategories.

The parent categories and subcategories that were used to manually label the dataset that was defined in a taxonomy file, with each row of the file consisting of a parent category column and its corresponding subcategories:

Hypothesis

The hypothesis suggests that by incorporating human input and manually labeling a subset of the dataset, better results can be achieved compared to traditional zero-shot approaches. In other words, if we provided a certain number of examples of input data with correctly labeled parent categories and subcategories (Tier 1 and Tier 2 labels) to a few-shot model, the model will be able to more accurately predict the Tier 1 and Tier 2 labels of unseen data.

Current Results from GPT-3 Shot Predictions:

Tier 1 Category Prediction Accuracy: 85%

Tier 2 Category Prediction Accuracy: 75%

Experimental Design

To test the hypothesis, the dataset of 1100 rows was split into training and test data. The test dataset consisted of 100 rows, which were manually labeled with categories and subcategories. The remaining data formed the training set. The following models were evaluated:

1. Zero-shot model using CLAUDE:

In this model, we used the Claude chatbot API from Anthropic along with prompt engineering techniques to create a model to predict the parent category and subcategory of a record containing site meta title and description. The prompt to Claude included a description of a list of categories along with their subcategories. The prompt then included a request to Claude to extract a category and a sub-category for each row of the test data with a meta description and title:

2. Few-shot model using CLAUDE:

As with the zero-shot model, information about the different parent categories and subcategories were provided in the prompt to Claude. Furthermore, the prompt also included examples of records with correctly labeled Tier 1 and Tier 2 categories. The entire prompt was then structured as follows with information on categories, examples, and then a request to categorize the given meta description and text.

Training and Evaluation

1. Zero-shot models were initially trained on the training set, which contained no labels.

2. The evaluation dataset, comprising 100 rows, was used to measure the accuracy of the zero-shot models.

3. Next, 25 rows from the training set (rows 101–125 of the original 1100 row dataset) were labeled and added as training data for the models. The accuracy of the models was re-evaluated using the evaluation dataset.

4. Subsequently, an additional 25 rows from the original 1100 row dataset (50 rows total) were labeled and included as training data to assess the models’ accuracy once more.

Results and Analysis

The effects of incorporating human input and adding labeled training data on model accuracy were analyzed based on the evaluation dataset. The accuracy of the zero-shot models was compared before and after adding labeled training data to determine the impact. We noticed that the tier 2 categories significantly was affected by human input:

Accuracy of Tier 2 Category Prediction

Zero-Shot model using CLAUDE: 75 %

Few-Shot model using CLAUDE with 25 labeled examples total: 81%

Few-Shot model using CLAUDE with 50 labeled examples total: 87%

From the results, it’s apparent that there was a slight improvement in the performance of the model for the prediction of Tier 2 results from 75% to 87%. This is better than the 75% accuracy achieved with the zero-shot learning techniques achieved with GPT-3 Zero Shot Predictions, as noted in the hypothesis section of this study.

Next Steps

For future steps, what we recommend working on is incorporating human feedback into the setfit model. Unlike traditional models such as Claude or GPT-3, Setfit demonstrates exceptional proficiency specifically in classification tasks. By utilizing Setfit, we can achieve not only accurate single label predictions but also expand the capabilities to include multi-label outputs (multiple tags per parent category). This advancement allows for a more comprehensive understanding of complex data, enabling the model to assign multiple labels within a single category. Setfit is also not dependent on prompt engineering techniques, but rather fine tuning. Furthermore, to create a more robust few-shot prediction model, we could utilize not just 50 total labeled examples, but rather have a certain number of labeled examples per parent category or subcategory in our training set. This could reduce the risk of having an unbalanced dataset where certain categories have more examples than others.

--

--

Anote

General Purpose Artificial Intelligence. Like our product, our medium articles are written by novel generative AI models, with human feedback on the edge cases.