Case Study: Enhancing Advertisements using Generative AI and Human Feedback

Anote
9 min readJul 9, 2023

--

Background

OpenAds, a leading company in generative AI research for advertisements, aims to revolutionize the way ads are generated. They possess a vast dataset comprising 80,000 rows of prompt data that needs to be accurately categorized into 640 hierarchical categories, including subcategories, sub-subcategories, and sub-sub-subcategories. The accuracy of these predictions is of utmost importance as it directly impacts the quality of ads delivered to OpenAds’ end users.

OpenAds currently utilizes GPT-4, a powerful generative AI model, for generating advertisement predictions. However, GPT-4 has a limitation of low accuracy, prompting OpenAds to explore methods for improving the precision and relevance of their generated ads. They believe that incorporating human feedback into their AI system could be the key to achieving higher accuracy over time.

Objectives

The primary objective of this case study is to investigate the impact of human feedback on enhancing the performance of OpenAds’ generative AI system. The specific goals are as follows:

  • Determine whether integrating human feedback can lead to increased accuracy in the hierarchical classification of advertisement prompts.
  • Assess the effectiveness of adding more training data for each category in improving the overall accuracy of the system.
  • Evaluate the feasibility of leveraging human feedback to iteratively refine the generative AI model and enhance ad generation capabilities.

Methodology

In order to accomplish the defined objectives, OpenAds devised a robust experimental approach. The task at hand involves working with hierarchical categories that are structured across three levels. Our approach revolves around predicting the hierarchical data based on textual descriptions. By leveraging this approach, we aim to effectively interpret and assign the appropriate categories to the given text descriptions.

Snippet of Sub-Categories

TIER_1: [‘Education’, ‘Hobbies & Interests’, ‘Shopping’, ‘Communication’, ‘Pop Culture’, ‘Science’, ‘War and Conflicts’, ‘Family and Relationships’, ‘Books and Literature’, ‘Disasters’, ‘Medical Health’, ‘Video Gaming’, ‘Pets’, ‘Home & Garden’, ‘Religion & Spirituality’, ‘Business and Finance’, ‘Sensitive Topics’, ‘Style & Fashion’, ‘Events’, ‘Genres’, ‘Personal Finance’, ‘Crime’, ‘Travel’, ‘Entertainment’, ‘Music’, ‘Careers’, ‘Productivity’, ‘Healthy Living’, ‘Attractions’, ‘Personal Celebrations & Life Events’, ‘Automotive’, ‘Technology & Computing’, ‘Food & Drink’, ‘Law’, ‘Fine Art’, ‘Sports’, ‘Politics’, ‘Maps & Navigation’]
TIER 2: {‘Education’: [‘Homeschooling’, ‘Online Education’, ‘Homework and Study’, ‘Adult Education’, ‘Early Childhood Education’, ‘Educational Assessment’, ‘Primary Education’, ‘Private School’, ‘Special Education’, ‘College Education’, ‘Language Learning’, ‘Secondary Education’], ‘Hobbies & Interests’: [‘Sci-fi and Fantasy’, ‘Musical Instruments’, ‘Content Production’, ‘Genealogy and Ancestry’, ‘Magic and Illusion’, ‘Collecting’, ‘Workshops and Classes’, ‘Antiquing and Antiques’, ‘Games and Puzzles’, ‘Birdwatching’, ‘Beekeeping’, ‘Model Toys’, ‘Radio Control’, ‘Cigars’, ‘Paranormal Phenomena’, ‘Arts and Crafts’]}
TIER 3: {‘Auto Technology’: [‘Auto Navigation Systems’, ‘Auto Safety Technologies’, ‘Auto Infotainment Technologies’], ‘Auto Shows’: [], ‘Auto Type’: [‘Classic Cars’, ‘Driverless Cars’, ‘Certified Pre-Owned Cars’, ‘Performance Cars’, ‘Concept Cars’, ‘Green Vehicles’, ‘Budget Cars’, ‘Luxury Cars’], ‘Computing’: [‘Computer Networking’, ‘Computer Peripherals’, ‘Programming Languages’, ‘Desktops’, ‘Computer Software and Applications’, ‘Internet’, ‘Data Storage and Warehousing’, ‘Laptops’, ‘Information and Network Security’], ‘Consumer Electronics’: [‘Smartphones’, ‘Wearable Technology’, ‘Home Entertainment Systems’, ‘Tablets and E-readers’, ‘Cameras and Camcorders’]}

In simple terms, we went through the three levels of tree categories until we reached the leaf category at the bottom. After that, we created dictionaries to connect these specific leaf categories with their parent categories. This approach allows us to predict the leaf categories accurately, and using the parent mappings, we can also predict the higher-level categories.

def extract_parent_child_mapping(leaf_node: str):
def traverse_category_tree(category_tree):
leaf_nodes = []
parent_mapping = {}
for tier_1_node in category_tree['TIER_1']:
if tier_1_node in category_tree['TIER_2']:
tier_2_nodes = category_tree['TIER_2'][tier_1_node]
parent_mapping[tier_1_node] = None
for tier_2_node in tier_2_nodes:
if tier_2_node in category_tree['TIER_3']:
leaf_nodes.extend(category_tree['TIER_3'][tier_2_node])
parent_mapping[tier_2_node] = tier_1_node
for leaf_node in category_tree['TIER_3'][tier_2_node]:
parent_mapping[leaf_node] = tier_2_node
return leaf_nodes, parent_mapping
def extract_parent_nodes(leaf_node, parent_mapping):
"""
Extract parent nodes given a leaf node
"""
parent_nodes = []
current_node = leaf_node
while current_node is not None:
parent_nodes.append(current_node)
current_node = parent_mapping.get(current_node)
parent_nodes.reverse()
return parent_nodes

This code defines a function called “extract_parent_child_mapping” that extracts the parent-child mapping for a given leaf node in a category tree. Here’s a breakdown of what the code does:

  1. The function takes one parameter: leaf_node: A string representing the leaf node for which the parent-child mapping needs to be extracted.
  2. Inside the function, there is a nested function called traverse_category_tree, which performs the traversal through the category tree and extracts the leaf nodes and parent mappings.
  3. The traverse_category_tree function initializes an empty list called leaf_nodes to store the leaf nodes, and an empty dictionary called parent_mapping to store the parent-child mappings.
  4. The function iterates through the tier 1 nodes in the category tree. If a tier 1 node has corresponding tier 2 nodes, it iterates through those tier 2 nodes. If a tier 2 node has corresponding leaf nodes, it adds those leaf nodes to the leaf_nodes list and establishes the parent-child mapping between the tier 1 node and tier 2 node, and between the tier 2 node and each leaf node.
  5. After traversing the category tree, the traverse_category_tree function returns the leaf_nodes and parent_mapping.
  6. The extract_parent_nodes function takes a leaf node and the parent mapping as parameters. It extracts the parent nodes of the given leaf node by traversing up the hierarchy using the parent mapping. The parent nodes are stored in a list called parent_nodes.
  7. The code example demonstrates how to use the extract_parent_nodes function by passing a leaf node and the parent mapping. The resulting parent nodes are stored in the parent_nodes variable.
  8. Finally, the function returns the parent_nodes list.

Here are the resulting parent category mappings that we used as part of our analysis.

Leaf nodes: [‘Standardized Testing’, ‘Postgraduate Education’, ‘College Planning’, ‘Undergraduate Education’, ‘Screenwriting’, ‘Freelance Writing’, ‘Audio Production’, ‘Video Production’, ‘Stamps and Coins’, ‘Comic Books’, ‘Card Games’, ‘Board Games and Puzzles’, ‘Roleplaying Games’, ‘Drawing and Sketching’, … ]
Parent mapping: {‘Education’: None, ‘Homeschooling’: ‘Education’, ‘Online Education’: ‘Education’, ‘Homework and Study’: ‘Education’, ‘Adult Education’: ‘Education’, ‘Early Childhood Education’: ‘Education’, ‘Educational Assessment’: ‘Education’, ‘Standardized Testing’: ‘Educational Assessment’, ‘Primary Education’: ‘Education’, ‘Private School’: ‘Education’, ‘Special Education’: ‘Education’, ‘College Education’: ‘Education’, ‘Postgraduate Education’: ‘College Education’, ‘College Planning’: ‘College Education’, ‘Undergraduate Education’: ‘College Education’, … ]

Dataset Creation

To facilitate the proof of concept, we selected a subsample of 6 leaf nodes tier 3 category predictions. We created a training dataset consisting of labeled items for each category. Specifically, we labeled 2 items, 8 items, and 16 items for each category, respectively. A separate test dataset was created, comprising labels across 6 different categories within the Tier 3 hierarchy. These labels were meticulously assigned by human annotators to serve as a benchmark for evaluating the system’s accuracy.

Train Dataset:

Category - Number of Samples
Buddhism - 22
Day Trips - 21
Mental Health - 27
Programming Languages - 25
Resume Writing and Advice - 15
Wedding - 15
Total - 125

Evaluation Dataset:

Category - Number of Samples
Buddhism - 12
Day Trips - 15
Mental Health - 20
Programming Languages - 18
Resume Writing and Advice - 9
Wedding - 2
Total - 76

Approach

After preparing the mappings between parent and child nodes, as well as identifying the corresponding leaf nodes, we can focus our model predictions solely on the leaf nodes by utilizing the parent node mappings. To achieve this, we utilize the SetFit package to prepare the few-shot model. The model can be defined as follows:

def few_shot_setfit_model(
train_df: Dataset,
test_df: Dataset,
model = "sentence-transformers/paraphrase-mpnet-base-v2",
tier = 'tier3_encoded'
):
# Load a SetFit Model from Hub
model = SetFitModel.from_pretrained(model)
# Trainer
trainer = SetFitTrainer(
model = model,
train_dataset=train_df,
eval_dataset=test_df,
loss_class=CosineSimilarityLoss,
metric="accuracy",
batch_size=4,
num_iterations=8
num_epochs=1,
column_mapping={"text": "text", tier: "label"}
)
return trainer

This code defines a function called few_shot_setfit_model that creates and configures a SetFit model for few-shot learning.

The function takes four parameters:

  • train_df: A dataset used for training the model.
  • test_df: A dataset used for evaluating the model.
  • model: The path of the pretrained model to use. If not specified, it defaults to “sentence-transformers/paraphrase-mpnet-base-v2”.
  • tier: The name of the column in the datasets that represents the tier or level of categories. By default, it is set to ‘tier3_encoded’.

The function loads a SetFit model from the Hugging Face model hub using the provided model parameter. This model is pretrained on a large corpus of text and can be fine-tuned for few-shot learning. It initializes a SetFitTrainer object, which is responsible for training and evaluating the SetFit model. The trainer is configured with the following parameters:

  • model: The SetFit model object created in the previous step.
  • train_dataset: The training dataset (train_df) provided as a parameter.
  • eval_dataset: The evaluation dataset (test_df) provided as a parameter.
  • loss_class: The loss function to use during training. In this case, it uses the CosineSimilarityLoss, which measures the similarity between text embeddings.
  • metric: The evaluation metric to track during training (“accuracy”).
  • batch_size: The batch size used during training. It is set to 4.
  • num_iterations: The number of text pairs to generate for contrastive learning. Contrastive learning is a technique used to learn from limited labeled data. It generates pairs of similar and dissimilar samples for training. Here, it generates 8 pairs.
  • num_epochs: The number of epochs to use for contrastive learning. Each epoch processes the entire training dataset once. In this case, it is set to 1.
  • column_mapping: A dictionary that maps the column names in the datasets to the expected column names by the trainer. It specifies that the “text” column in the datasets should be mapped to “text”, and the tier column should be mapped to “label”.

Finally, the function returns the configured SetFitTrainer object. This code sets up a SetFit model and trainer for few-shot learning, allowing the model to be trained on a small labeled dataset and evaluated on another dataset.

Human Feedback Integration:

The accuracy assessment of our generative AI system was conducted by evaluating its performance on a test dataset that had been labeled by human annotators. This labeled dataset served as the benchmark against which we measured the accuracy of the AI model and assessed the impact of incorporating human feedback. The level of human feedback was determined by the number of labels per category that were utilized to retrain the fine-tuned SetFit model.

One of our main objectives was to demonstrate how our model’s performance evolves with increasing levels of human feedback. In the training dataset, we had a total of 125 samples representing 6 leaf nodes, all of which were labeled by human annotators. To explore the effect of human feedback, we resampled the training dataset into three categories. Each category contained the same 6 sub leaf nodes, but with varying numbers of samples: 2 times, 8 times, and 16 times.

For each category of the dataset, we ran the model that had been trained on the corresponding resampled dataset. The results of these evaluations are presented in the following table.

Results

Dataset - Setfit Accuracy
2 labels per category - 0.8947
8 labels per category - 0.9342
16 labels per category - 0.9474

The results of this case study demonstrate the significant impact of human feedback in improving the accuracy and relevance of OpenAds’ generative AI system. The following key aspects were analyzed to gain valuable insights:

  1. Effect of Human Feedback on Performance: The study examined how the incorporation of human feedback positively influenced the system’s accuracy, particularly in hierarchical classification. The feedback provided by human
  2. Correlation between Training Data and Accuracy: The study investigated the relationship between the amount of training data available for each category and the resulting accuracy. By analyzing the impact of adding more labels per category, the study revealed a direct correlation between the quantity of training data and improved accuracy.
  3. Feasibility and Benefits of Iterative Human Feedback: The study explored the feasibility and potential benefits of integrating human feedback as an iterative process. By continuously refining the generative AI model based on human input, this study shows that it is feasible to enhance the system’s performance and deliver higher-quality ads.

Future Approaches

In this particular study, our model was trained using only six categories across three levels of labels. However, there are potential future approaches that can further enhance the model’s capabilities. One such approach involves expanding the training process to incorporate a more comprehensive set of categories. By including a broader range of categories, the model can gain a deeper understanding of various topics and improve its ability to classify and predict labels.

Furthermore, another avenue for improvement involves expanding the model to encompass up to three levels of labels. While the current implementation focuses on three levels, there may be instances where additional levels of granularity are necessary to accurately capture the complexity and nuances of the data. By extending the model to accommodate three levels of labels, we can refine the categorization process and provide more detailed insights into the relationships between different categories.

These future approaches aim to enhance the model’s performance and make it more robust and versatile in handling a wider range of categorization tasks. Via incorporating more categories and expanding the label hierarchy, we can unlock the potential for improved accuracy and a more comprehensive understanding of the data.

Summary

In conclusion, this case study emphasizes the potential of leveraging human feedback to enhance the accuracy and quality of generative AI in advertisement generation. The findings provide actionable insights for OpenAds to refine their generative AI system, address the limitations of low accuracy in GPT-4, and ultimately deliver more precise and relevant ads to their users. By combining the power of generative AI with human expertise, OpenAds aims to usher in a new era of advertisement generation, where human feedback plays a vital role in continuously improving the accuracy and effectiveness of AI-driven ad creation.

--

--

Anote

General Purpose Artificial Intelligence. Like our product, our medium articles are written by novel generative AI models, with human feedback on the edge cases.