Case Study: Data Scraping for Hardy Riggings

3 min readJul 9, 2023

Introduction

In this case study, we will explore the process of scraping data from 79 URLs associated with Hardy Riggings. Our objective is to extract valuable information such as product titles, descriptions, variants, prices, product manuals and certifications, and spec sheets. The extracted data will be formatted into a CSV file suitable for importing into WooCommerce, a popular WordPress plugin for e-commerce.

Approach

To achieve our goal, we followed the following steps:

1. Scraping Text from URLs

We utilized the Python library “tika” along with the “requests” module to retrieve the content from each URL. By using Apache Tika, we ensured proper handling of various document formats. The code snippet below demonstrates the process:

from tika import parser as p
import requests
def get_text_from_url(web_url):
  response = requests.get(web_url)
  result = p.from_buffer(response.content)
  text = result["content"].strip()
  text = text.replace("\n", "").replace("\t", "")
  return text

2. Chunking Text and Generating

The text obtained from each URL was chunked and converted into embeddings. This step helps provide context to the AI model, enabling it to generate more accurate predictions.

chunks = create_chunks(text, n=1000, tokenizer=tokenizer)
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]

response = openai.Embedding.create(
    input=text,
    model="text-embedding-ada-002"
)
embeddings = response['data'][0]['embedding']

3. Prompting the Model for Information Extraction

We prompted an AI model with specific questions for each URL to extract the desired information. The questions we asked for each URL were:

PRODUCT TITLE: What is the product title?
PRODUCT DESCRIPTION: Provide a description of the product.
VARIANTS: What are the variants of the product? (Sizes with respective SKUs and prices)
PRICE: What is the price of the product?
PRODUCT MANUAL: Does the product have a manual or certifications? If yes, describe key points in four sentences or less.
SPEC SHEET: Is there a spec sheet available? If yes, describe the key specifications.

# Load the question answering chain
qa_chain = load_qa_chain(llm, chain_type="stuff")

# Generate the answer using the question answering chain
answer = qa_chain.run(input_documents=text, question=question)

By interacting with the AI model in this manner, we were able to extract the required information efficiently.

4. CSV Output Generation

The model’s responses to our prompts were compiled into a CSV file. Each row in the CSV file represents a URL, while each column corresponds to a specific question. The cells contain the AI model’s predicted answers for the respective URL and question.

The generated CSV file contains the predicted information extracted from each URL associated with Hardy Riggings. The revised CSV format is compatible with WooCommerce, facilitating seamless importation of the extracted data.

Conclusion

By following the outlined steps, we successfully extracted data from the 79 URLs associated with Hardy Riggings. The resulting CSV file provides a comprehensive overview of each product, including essential details such as product title, description, variants, price, product manual and certifications, and spec sheet. The accuracy and completeness of the extracted information enhance the efficiency of WooCommerce integration, enabling streamlined product management within the WordPress ecosystem.