Case Study: Identifying Ship Names in Websites with Claude and GPT3

Anote
7 min readJul 9, 2023

--

Background

Ship Index, based in Ithaca, NY, specializes in identifying vessels in text data. Their platform, ShipIndex.org, provides a database of maritime resources, including books, journals, magazines, newspapers, CD-ROMs, websites, and online databases. This platform makes vessel research easy by enhancing these resources with additional information like illustrations, crew and passenger lists, and resource availability.

Challenge

Ship Index wanted us to ideally be able to extract every single ship name from all sublinks (approximately 3,500 web pages) of the https://www.theshipslist.com/ site. It turns out that vessel names are not mutually exclusive with respect to people’s names and the names of locations. In other words, there are ship names that could be named after people and also cities, states, or countries. Thus, filtering out locations and people could also filter out important ship names. Furthermore, the vessel names are also embedded within many different parts of the web page such as tables, paragraphs, headings, captions etc. and therefore, detecting ship names is not unique to a particular format. The lack of a structured index hindered efficient ship research and posed difficulties for individuals seeking specific ship information.

Solution

ShipIndex adopted a solution comprising web crawling, using OpenAI’s GPT-3 for name extraction, post-processing, and refinement, and establishing a URL-Ship Name association.

Web Crawling with Apache Tika:

The first part of the code involves gathering data from the website using web scraping techniques. The requests library is used to send HTTP requests, and BeautifulSoup is used to parse the HTML content of the pages. The tika parser is used to extract text from the HTML content.

  • get_text_from_url(web_url): This function fetches the webpage at the provided URL, parses the content using Apache Tika to extract the text from the webpage, and returns the cleaned text.
def get_text_from_url(
web_url
):
response = requests.get(web_url)
result = p.from_buffer(response.content)
text = result["content"].strip()
text = text.replace("\n", "").replace("\t", "")
return text
  • get_links(initial_url): This function sends a GET request to the initial_url, parses the HTML of the webpage using BeautifulSoup, assuming links are relative to “https://www.theshipslist.com".
def get_links(
initial_url: str = "https://www.theshipslist.com/"
):
# send a GET request to the website's URL
response = requests.get(initial_url)
# parse the HTML code with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# find all <a> tags and extract the href attribute (the hyperlink)
links = []
links_text = []
for link in soup.find_all('a'):
if type(link.get('href')) == str:
if link.get('href')[0] == "/":
web_url = "https://www.theshipslist.com" + link.get('href')
web_text = get_text_from_url(web_url)
if len(web_text) > 0:
links.append(web_url)
links_text.append(web_text)
return links, links_text
  • get_sublinks(links): This function receives a list of URLs, visits each of these URLs, extracts all the hyperlinks in these pages, and returns a list of these links, and the corresponding text.
def get_sublinks(
links: List[str]
):
sublinks = []
sublinks_text = []
for link in links:
# send a GET request to the website's URL
if type(link) != str:
continue
response = requests.get(link)
# parse the HTML code with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# find all <a> tags and extract the href attribute (the hyperlink)
for sublink in soup.find_all('a'):
sublinks.append(sublink.get('href'))
if type(sublink.get('href')) == str:
if sublink.get('href')[0] == "/":
web_url = "https://www.theshipslist.com" + sublink.get('href')
sublinks.append(web_url)
web_text = get_text_from_url(web_url)
if len(web_text) > 0:
sublinks.append(web_url)
sublinks_text.append(web_text)
sublinks = list(set(sublinks))
return sublinks, sublinks_text

GPT-3 Ship Name Extraction:

Next, GPT-3, an advanced language model developed by OpenAI, was used to analyze the text data from the ShipsList website. GPT-3 was tasked with identifying ship names within the text. Here is how it worked:

  1. Processing loop: The script then enters a loop, where each iteration handles a single row (URL and its corresponding text) from the CSV file.
  2. Chunking: It splits the text content into smaller chunks with a maximum of 2800 tokens each (as there is a limit on the maximum number of tokens that can be processed in a single request by the OpenAI API).
  3. Call GPT-3 / Claude Model: For each chunk, the script sends a request to the OpenAI API, or the Claude API, instructing the AI model to “Find all the names of ships here. Output as a csv, and only the csv. Do not output peoples names or destinations: {selected_text}”. This is asking the model to identify ship names from the provided text (selected_text).
  4. Convert to DataFrame: A temporary DataFrame is created to hold these ship names along with the corresponding URL. This temporary DataFrame is then appended to a main DataFrame (shiplist_df) for accumulating all ship names across all text chunks within the URL.
  5. Output: After processing each URL and its corresponding text, the collected ship names are written out to a CSV file named with the current count. The CSV file is stored in the ‘output’ directory.

GPT-3 Model Code:

count = 0
while count < len(df):
too_long_urls = []
short_df = df[count:count+1]
shiplist_df = pd.DataFrame()
for i, row in short_df.iterrows():
# Extract the web_URL and web_text columns
url = row['citation']
text = row['text']
count_tokens = 0
max_tokens = 2500
max_count_of_text = int(math.ceil((len(text) / max_tokens)))
while count_tokens < max_count_of_text:
length = min((count_tokens*max_tokens + max_tokens), len(text))
selected_text = text[(count_tokens*max_tokens):length]
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0301",
messages=[
{
"role": "user",
"content": f"Find all of the names of ships here: {selected_text}. Return each ship, with the name of each ship in a list separated by a comma. For the ship name, don't include any other words besides the name of the ship."
}
]
)
ships = completion.choices[0].message.content
ship_list = ships.replace("\n", "").split(", ")
temp_df = pd.DataFrame()
temp_df["url"] = [url for i in range(len(ship_list))]
temp_df["ship_name"] = ship_list
shiplist_df = pd.concat([shiplist_df, temp_df])
count_tokens += 1
shiplist_df.to_csv(str(count) + "_temporary_shiplist.csv")
count += 1

Claude Model Code:

count = 0
while count < len(df):
too_long_urls = []
short_df = df[count:count+1]
shiplist_df = pd.DataFrame()
for i, row in short_df.iterrows():
# Extract the web_URL and web_text columns
url = row['citation']
text = row['text']
count_tokens = 0
max_tokens = 1000
max_count_of_text = int(math.ceil((len(text) / max_tokens)))
while count_tokens < max_count_of_text:
length = min((count_tokens*max_tokens + max_tokens), len(text))
selected_text = text[(count_tokens*max_tokens):length]
# Use Claude to extract the ship names
response = claude_client.completion(
prompt=f"{anthropic.HUMAN_PROMPT} Find all of the names of ships here in the following unstructured text: {selected_text}. Return each ship, with the name of each ship in a list separated by a comma. \
For the ship name, don't include any other words besides the name of the ship. Please note that the name of the ship can be located \
at the beginning of the scraped text, where it precedes a description of its voyage. For example, the ship name in the following text is \
'Caledonia': Caledonia - 26th trip up, Quebec to Montreal 29th October 1822. It can also be located in a table of a webpage \
underneath the column 'Vessel' or 'Ship Name'. {anthropic.AI_PROMPT}",
stop_sequences=[anthropic.HUMAN_PROMPT],
model="claude-v1",
max_tokens_to_sample=max_tokens,
temperature=0.5
)
completion = response["completion"]
ship_list = completion.replace("\n", "").split(", ")
temp_df = pd.DataFrame()
temp_df["url"] = [url for i in range(len(ship_list))]
temp_df["ship_name"] = ship_list
shiplist_df = pd.concat([shiplist_df, temp_df])
count_tokens += 1
shiplist_df.to_csv("./output_claude/" +str(count) + "_temporary_shiplist.csv")
count += 1

Post-processing and Refinement:

After obtaining the initial extraction results, Ship Index implemented post-processing techniques to refine these results. These techniques included filtering out false positives, handling variations in ship name representations, and ensuring reliable ship name identification, thus improving the accuracy of the results.

  1. Dropping duplicates and unnecessary columns: Initially, the code loads the CSV file into a DataFrame and drops duplicate rows based on “url” and “ship_name” and then removes unwanted column names.
  2. Removing multiple entries for the same URL: If the same URL appears multiple times in the “passenger lists” category, all but the first appearance are dropped.
  3. Cleaning the ‘ship_name’ column: The script drops rows if the “ship_name” field is a certain type (float), contains certain substrings (e.g., “N/A”, “Mr.”, “Mrs.”, etc.), or has a certain length or number of words. This is to remove irrelevant or incorrect ship names.
  4. Replacing and formatting strings: The code then goes through and replaces certain characters and phrases in the “ship_name” column. It also removes any leading or trailing numbers or punctuation and removes any rows where the ship name is blank.
  5. Removing rows based on named entities: The code then uses the SpaCy library to remove rows where the ship name is recognized as a person’s name (based on SpaCy’s named entity recognition). It then does the same for locations.

Modifications:

One of the first major changes we came up with for the above solution was modifying the prompt provided to ChatGPT to give a little more information on how to identify names of ships. For instance, in our prompt we noted that “the name of the ship can be located at the beginning of the scraped text, where it precedes a description of its voyage. For example, the ship name in the following text is ‘Caledonia’: Caledonia — 26th trip up, Quebec to Montreal 29th October 1822. It can also be located in a table of a webpage underneath the column ‘Vessel’ or ‘Ship Name’.” This was to clear up any ambiguities between ships and other categories.

We had also utilized the Bard API by utilizing a token obtained from the JavaScript console. However, after multiple attempts to use it, we got rate limited and kept getting error messages for our request responses.

After utilizing ChatGPT’s API and Bard’s API, we looked into using Google’s Bard API and then Claude, the LLM developed by Anthropic. Claude seemed more useful in the sense that we had a much larger context window with which we could extract more information from the text scraped from the webpage.

GPT3 Results:

Before Post-processing

  • 3,150 URLs processed
  • 181,182 citations extracted
  • Average of 57 citations per URL

After Post-processing

  • 2,842 URLs processed
  • 111,028 citations remaining after refinement
  • Average of 35 citations per URL

In total, we identified 57,733 unique ships in shipslist that otherwise would have gone unnoticed.

Claude Results on non-passenger list URLs:

Before Post-processing (has duplicates)

  • 1270 URLs processed
  • 85,742 citations extracted
  • Average of 67 citations per URL
  • 33989 unique ships identified

After Post-processing

  • 1243 URLs processed
  • 63,884 citations remaining after refinement
  • Average of 51 citations per URL
  • 32449 unique ships identified

Conclusion

By leveraging Apache Tika for web crawling, OpenAI’s GPT-3 for ship name extraction, and implementing post-processing techniques, Ship Index successfully created a structured index of ship names from the ShipsList website. This solution significantly enhanced ship research capabilities by providing researchers with an organized and easily accessible resource for locating specific ship information.

--

--

Anote

General Purpose Artificial Intelligence. Like our product, our medium articles are written by novel generative AI models, with human feedback on the edge cases.