Case Study: Text Categorization for Social Media Data

5 min readAug 23, 2023

Problem
A large social media company aiming to help consumers analyze content on social media platforms like Instagram and TikTok, was looking to predict the categories of text data extracted from captions in Instagram, TikTok, and YouTube posts. By accurately categorizing the text data, the company aims to identify patterns and trends within different categories and understand the corresponding engagement metrics, including the number of likes, comments, shares, and the frequency of posts in each category on different social media platforms. This information can provide valuable insights for content analysis, marketing strategies, and identifying emerging trends on social media platforms.

Dataset
To tackle this problem, we loaded the social media text data from various channels and performed text cleaning operations. To preprocess that data, converted the text to lowercase, removed punctuation, replaced newlines and hashtags, and eliminated extra whitespaces. We were trying to predict the following categories:
- Business
- Animals/Pets
- Causes & Charities
- Science & Technology
- Sports
- TV & Movies
- Lifestyle
- Music
- Home Improvement
- Health & Fitness
- Fashion
- Religion
- Travel
- Cosplay
- Mental Health
- Gaming
- Comedy
- Dance
- Food/Cooking

Approach
To address this problem, we will employ various techniques and approaches:

Zero Shot Classification
We will explore the application of zero-shot classification techniques using Spacy and Hugging Face frameworks. Zero-shot classification allows the model to predict categories that were not seen during the training phase.
● Spacy: We leveraged the Spacy library to develop a zero-shot text classification model. We used a particular model called “en_core_web_md.” This model comes with pre-trained word meanings and language details. We added a component called “text_categorizer” to Spacy and made it work specifically for the categories we wanted to classify. The zero_shot_spacy function in the code sets up the Spacy pipeline for zero-shot classification.

def zero_shot_spacy(
  classes,
  model,
):
  nlp_zero = spacy.load(“en_core_web_md”)
  nlp_zero.add_pipe(
    “text_categorizer”,
    config={
      “data”: classes,
      “model”: model,
      "cat_type": "zero",
    }
  )
  return nlp_zero

● Hugging Face: Additionally, we explored the Hugging Face library, which provides a wide range of pre-trained models for NLP tasks. The model that we used was called “typeform/distilbert-base-uncased-mnli.” The
zero_shot_hugging_face function in the code configures the Spacy pipeline with a pre-trained Hugging Face model for zero-shot classification.

def zero_shot_hugging_face(
  classes,
  model,
):
  nlp_zero = spacy.blank(“en”)
  nlp_zero.add_pipe(
    “text_categorizer”,
    config={
      “data”: classes,
      “model”: “typeform/distilbert-base-uncased-mnli”,
      “cat_type”: “zero”,
    }
  )
  return nlp_zero

Synthetic Data Approach
To augment the labeled data for training the text categorization models, we utilized a synthetic data approach. Synthetic data generation creates additional examples for each category, providing a more diverse and balanced training dataset. Here are a few examples for each category:

1. Business:
● “This text is about business.”
● “Here’s some information related to the business industry.”
● “Business strategies and trends are evolving.”

2. Animals/Pets:
● “Learn how to care for your pets and provide them with a loving
environment.”
● “Discover interesting facts about different animal species.”
● “Pets bring joy and companionship to our lives.”

3. Causes & Charities:
● “Support important causes that align with your values.”
● “Charities work tirelessly to address various social issues.”
● “Join forces with charities to create a better world.”

4. Science & Technology:
● “Discover how technology is transforming various industries.”
● “Explore the fascinating world of robotics and artificial intelligence.”
● “Science and technology drive innovation and progress.”

5. Sports:
● “This text is about sports.”
● “Get the latest updates on your favorite sports teams and events.”
● “Athletes inspire us with their dedication and achievements.”

These synthetic examples were generated to expand the training dataset and ensure a more comprehensive representation of each category. By incorporating synthetic data along with the labeled data, we can improve the models’ ability to generalize and make accurate predictions for text categorization tasks.

Results
After implementing our text categorization solution using zero-shot classification techniques and synthetic data augmentation, we evaluated the performance of the models and conducted exploratory data analysis to gain insights into the categorized social media data.

Model Evaluation
We manually labeled 100 rows of data as “test data” to evaluate the models’ metrics such as accuracy, precision, recall, and support. The rest of the data was used as “train data” for training the models. We used this labeled test data to measure the performance of our models to compare accuracy.

Exploratory Data Analysis
To gain a better understanding of the categorized social media data, we conducted exploratory data analysis. The analysis focused on two aspects: the distribution of predicted categories by data source and the relationship between predicted categories and engagement metrics such as likes.

By examining the distribution of predicted categories by data source, we gained insights into the sources of the data and their prevalence across social media platforms like TikTok, Instagram, and YouTube. This information allowed us to identify any variations in the content and engagement patterns across different platforms.

The likes-by-category analysis revealed interesting trends and patterns. By visualizing the relationship between predicted categories and the number of likes, we were able to identify categories that garnered higher engagement and those that were less popular. Understanding the engagement metrics associated with each category can help in developing effective content analysis, marketing strategies, and trend identification on social media platforms.

Next Steps
Based on the results and analysis of our text categorization project, we have identified several next steps to further enhance the models’ performance and provide valuable insights:

Few-Shot Learning
Incorporating few-shot learning techniques can enhance the categorization models’ accuracy, especially for categories with limited labeled samples. By utilizing a few-shot learning approach, we can leverage a small number of labeled instances to improve category predictions. This will further refine the models’ ability to generalize and make accurate predictions for text categorization tasks.

Human Feedback vs. Zero-Shot Models
To validate the superiority of human feedback over zero-shot classification models, we plan to conduct a comparative analysis. We will manually label additional rows of data on the “train data” and run the few-shot model with different amounts of labeled data.
This analysis will demonstrate how the accuracy of the few-shot model improves with increasing human feedback, surpassing the accuracy achieved by the zero-shot models. This evidence will highlight the importance of incorporating human expertise in text categorization tasks when high accuracy is desired.

Case Study: Text Categorization for Social Media Data

Written by Anote