Case Study: Preprocessing Structured Data with Generative AI to Enhance Longevity Research

7 min readMay 30, 2023

Background

Harvard University is at the forefront of exploring longevity, with an aim to extend human lifespan. Aging remains the leading risk factor for chronic diseases and mortality. To understand it comprehensively, there’s a need to measure biological age accurately.

Several machine learning algorithms, termed aging clocks, have been developed that can predict the age of biological samples based on omics data. However, a systematic resource for profiling biological age is still lacking. Enter, ClockBase.

ClockBase: A Comprehensive Aging Clock

ClockBase is a comprehensive platform that integrates multiple aging clock models, including epigenetic clocks for humans and mice. It profiles the biological age of samples derived from diverse tissues, cell types, and even single cells.

ClockBase curates the 11 top-performing aging clock models and applies them to over 2,000 publicly available DNA methylation datasets from resources like the Gene Expression Omnibus (GEO). It comprises biological age data for roughly 200,000 samples in both mice and humans. Furthermore, researchers can upload their own data for biological age calculation. The platform offers an interactive analysis tool for statistical analyses and visualization of biological age data.

By leveraging ClockBase, researchers can explore biological age in different samples, discover new longevity interventions, and identify age-accelerating conditions. This significantly contributes to the scientific community’s understanding of aging.

Problem: Data Conversion

During the course of the research, a dataset was encountered which needed conversion from one structured format to another. Manually carrying out this conversion was not feasible due to the large volume of data and the intensive labor and time required.

Dataset

The original dataset encompassed 2,424 unique identifiers, each containing the following types of files:

Desc file: This file included a ‘description’ and a ‘summary’ of the unique identifier.
Meta file: This was a CSV file containing specific medical data under different columns.
Target file: This file specified the desired column structure for the output CSV file.

Here are the target columns which we would like to predict the output for:

TARGET_COLUMNS = [
  "GSM_ID",
  "race",
  "sex",
  "age",
  "genetic_info",
  "disease",
  "tissue",
  "cell_line",
  "vivo_vitro",
  "case_control",
  "group_name",
  "treatment",
  "perturbation_category"
]

The tables below illustrate a few instances from the dataset for clearer understanding:

Example 1: GSE2653

Desc File:

Meta File:

Target File:

Example 2: GSE3280

Desc File:

Meta File:

Target File:

In total, there are 205,591 rows of df_meta data, so each of the 2,424 datasets has on average 85 rows. Given that there are 13 target columns per row of data, there are 2,672,683 cells of data that we would need to manually label. Given the volume of data, leveraging a manual approach to label data in a spreadsheet would take months to years, cost a lot of money, and would be inefficient and unscalable.

Solution: Automated Data Conversion using Programmatic Labeling Functions, GPT-3 and GPT-4

To tackle the challenge of converting the dataset, we implemented an automated solution that involved programmatic labels and integration with GPT-3 and GPT-4.

Programmatic Labels

We developed programmatic labeling functions to automate the extraction and categorization of information from the dataset. These functions allowed us to identify specific columns and categories based on heuristic entities. Here are some examples of the programmatic labeling functions:

get_vivo_vitro_from_text(text): This function checks if the given text contains keywords indicative of in vivo or in vitro studies.

def get_vivo_vitro_from_text(text):
  """
  This function checks if the given text contains keywords indicative of in vivo or in vitro studies.
  """
  # check for in vivo keywords
  vivo_keywords = ["vivo", "human", "in vivo", "live human", "live animal", "animal models", "live animals", "clinical trial", "patients", "human subjects"]
  for keyword in vivo_keywords:
    if keyword in text:
      return "in vivo"
  # check for in vitro keywords
  vitro_keywords = ["vitro", "in vitro", "in cell culture", "in cell line", "cell culture", "cell lines", "cultured cells", "primary cells", "isolated cells"]
  for keyword in vitro_keywords:
    if keyword in text:
      return "in vitro"
  # if no keywords are found, return N/A
  return "N/A"

get_disease_from_text(text): This function checks if the given text contains a disease entity and returns the disease name.

def get_disease_from_text(text):
  """
  This function checks if the given text contains a disease entity and returns the disease name.
  """
  # use a regular expression to match a disease entity
  disease_regex = r'\b(AIDS|Alzheimers|asthma|cancer|celiac_disease|chickenpox|cholera|common_cold|dengue|diabetes|epilepsy|gastroenteritis|heart_disease|hepatitis|influenza|malaria|measles|meningitis|mumps|norovirus|pneumonia|polio|rabies|rubella|shingles|smallpox|T2D|tuberculosis|typhoid|typhus|whooping_cough)\b'
  match = re.search(disease_regex, text, re.IGNORECASE)
  if match:
    return match.group(0)
  else:
    return "N/A"

By leveraging these programmatic labels, we automated the process of extracting relevant information and populating the target columns.

GPT-3 Integration for Data Conversion

We integrated GPT-3, a powerful language model, to generate predictions for specific columns in the dataset. By prompting GPT-3 with relevant information, we obtained accurate predictions for columns such as perturbation category and tissue. Here are examples of functions using GPT-3 integration:

get_perturbation_category_with_gpt_from_text(text): This function predicts the perturbation category based on the provided text using GPT-3.

def get_perturbation_category_with_gpt_from_text(text):
  perturbation_categories = [
    "disease model",
    "genetic manipulation",
    "small molecule",
    "none",
    "diet",
    "environmental",
    "others"
  ]
  completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
      {
        "role": "user",
        "content": f"We have the following text: {text}. We want to predict which of the following perturbation categories the text is classified as: {perturbation_categories}. For the prediction, don't include any other words besides the name of the category predicted. This output should be one of the categories in: {perturbation_categories} with no text before or after. If unsure, just include the text 'N/A' with no additional words."
      }
    ]
  )
  description_prediction = completion.choices[0].message.content
  return description_prediction

get_tissue_with_gpt_in_text(text): This function predicts the tissue based on the provided text using GPT-3.

def get_tissue_with_gpt_in_text(text):
  completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
      {
        "role": "user",
        "content": f"We have the following text: {text}. We want to predict the tissue in the text. For the prediction, don't include any other words besides the tissue. This output should be a string with no text before or after. If unsure, just include the text 'N/A' with no additional words."
      }
    ]
  )
  description_prediction = completion.choices[0].message.content
  return description_prediction

These GPT-3 integrated functions played a crucial role in automating the data conversion process and ensuring accurate predictions for specific columns.

GPT-4 Integration

We leveraged the capabilities of GPT-4, a powerful language model, to generate predictions based on the training data. The integration with GPT-4 enabled us to automate the data type conversion process. Here is an example of how GPT-4 integration was used for data conversion:

def run_gpt4_prompt(
  title,
  summary,
  df_meta_text
):
information = """
For all the following columns, please use the standard universal scientific/medical term instead of abbreviations.
- GSM_ID: The unique identifier for the sample.
- race: The race of the individual (european, asian, african, other).
- sex: The sex of the individual (M, F).
- age: The age of the individual in years. If no detailed age information, use stage name (embryo, child, young adult, middle age adult, old adult).
- genetic_info: The types of genetic modifications present.
- disease: The health status of the individual (healthy or the standardized name of disease).
- treatment: The name of the treatment administered.
- tissue: The standardized name of tissue from which the sample was obtained (e.g., blood, skin, liver).
- cell_line: The name of the cell type or cell line used.
- vivo_vitro: The experimental context (in vivo for live humans or in vitro for cell lines).
- case_control: The biological comparisons for the experiment using 'ctrl' or 'case'. If there are multiple different conditions or different combination of conditions, use 'case1', 'case2', etc. For samples not in the main comparison, use 'NA'.
- group_name: A short name describing the conditions in the control and case groups, which will be used in scientific visualizations.
- perturbation_category: The category of perturbation that differentiates the control and case groups. The categories are:
- disease (any disease or medical conditions)
- genetic (any genetic modification)
- small molecule (drugs, metabolites, etc.)
- lifestyle (diet, exercise, smoking, etc.)
- environmental (microenvironment, pollution, etc.)
- others
- none
"""
completion = openai.ChatCompletion.create(
model="gpt-4",
messages=[
    {
    "role": "user",
    "content": f"You are a team of expert biologists and clinicians responsible for curating a dataset from the Gene Expression Omnibus. The unstructured metadata is provided as follows: Title: {title}, Summary: {summary}, Data: {df_meta_text}. Your objective is to standardize this metadata and return a CSV file containing the following columns: {TARGET_COLUMNS}. Detailed information about each column can be found here: {information}. If additional information is required for case_control statistical comparisons, you may include columns starting with 'covariate_'. Infer as much information as possible and use 'NA' for unknown details. As part of an automated system, your response should only consist of the CSV file, without any text before or after."
    }
  ]
)
description_prediction = completion.choices[0].message.content
return description_prediction

Through this automated method, the conversion of the large dataset became feasible and efficient, driving the longevity research further and enabling the scientists to focus on core research tasks.

Results

Through the combined use of programmatic labels, GPT-3 integration, and GPT-4 integration, we were able to automate the conversion of the large dataset. This automation not only saved valuable time and cost but also ensured high accuracy in the generated predictions.

The successful outcome of the solution allowed the team to leverage the algorithm for training a model via clock-based to explore the possibility of extending human life. This research has the potential to bring significant benefits to society.