Pulling Data from AO3

hornyhermionebot
3 min readOct 27, 2021

Hello hello, I’m hornyhermionebot and recently I pulled a bunch of data from AO3 to generate some statistics on twitter people seem to be into.

A couple of quick notices, I work in data analysis and webdev in general, so there’s a fair bit of background knowledge that would be helpful for someone going into this, but its not absolutely necessary. There’s a massive number of tutorials online on how to do this, and all my code for this was so jank anyway that I’d be lying if I called this professional.

First things first, the ~techstack~.

All data was pulled using a micro AWS instance (if you’ve never used one before, they’re free!) because I didn’t want my computer to have to run this process. It’s relatively lengthy if you’re being kind to the AO3 servers (around ~7hrs for ~35,000 fics), but if you’re proficient and only plan on doing this once it could be sped up.

All webscraping was done using Python requests, and BeautifulSoup. To initalise, it’s first important to know the number of pages that need to be scraped. The easiest method of doing this I found was the following code:

def get_all_pages():
base_url = “https://archiveofourown.org/tags/Hermione%20Granger/works"
r = requests.get(base_url, headers=headers)
soup = BeautifulSoup(r.content, ‘html.parser’)
all_pages = { “class”: “pagination actions” }lis = []
x = soup.find(“ol”, all_pages)
for li in x.findAll(‘li’):
for a in li.findAll(‘a’):
lis.append(a.contents[0])
all_pages = lis[-2]
return {
“base_url”: “https://archiveofourown.org/tags/Hermione%20Granger/works",
“total_pages”: int(all_pages)
}

This first loads the page for all Hermione works, looks for the pagination section, and finds the final number listed. This is the total number of pages that need to be scraped. This can be done for any fandom pairing, just change the base urls.

def find_fics(base_url):
while True:
r = requests.get(base_url, headers=headers)
if r.status_code == 200:
break
else:
print(f”TIMING OUT — STATUS: {r.status_code}”)
print(r.headers)
time.sleep(300)
print(“REATTEMPTING”)
continue
soup = BeautifulSoup(r.content, ‘html.parser’)
all_works = { “class”: “work index group” }
all_works_souped = soup.find(“ol”, all_works)
try:
all_fics_on_page = all_works_souped.findChildren(“li”, recursive=False)
for i in range(0, len(all_fics_on_page)-1):
chosen_fic = all_fics_on_page[i]
if tag_check(chosen_fic):
tags = { “class”: “tags commas” }
y = chosen_fic.find(“ul”, tags)
relationships = y.find(“li”, { “class”: “relationships” }).text.replace(‘\n’, ‘’)
rating = chosen_fic.find(“span”, { “class”: “rating” }).text.replace(‘\n’, ‘’)
datetime = chosen_fic.find(“p”, { “class”: “datetime” }).text.replace(‘\n’, ‘’)
language = chosen_fic.find(“dd”, { “class”: “language” }).text.replace(‘\n’, ‘’)
word_count = int(chosen_fic.find(“dd”, { “class”: “words”}).text.replace(‘\n’, ‘’).replace(‘,’, ‘’))
hits = int(chosen_fic.find(“dd”, { “class”: “hits”}).text.replace(‘\n’, ‘’).replace(‘,’, ‘’))
object = {
“relationship”: relationships,
“rating”: rating,
“datetime”: datetime,
“language”: language,
“words”: word_count,
“hits”: hits
}
relationships_array.append(object)
except:
pass

This absolute mess of code has the following flow:

  • Attempt to load page, on fail (usually due to 429 too many requests error) waits for 5 mins before proceeding.
  • Upon page load, generates an array of each individual fic on the page.
  • Checks each individual fic for the keyword Hermione in the primary relationship
  • If it finds this keyword, it then loads the relationship, the rating (explicit etc.), publish date, language, words and hits. It can be expanded to check for comments, kudos, etc.
  • Adds this to a blank array, and moves to the next one.

After processing all this data, it converts each object ( {name: value} ) to a row in a pandas dataframe, and saves it to the EC2.

All data analysis was done using Pandas and matplotlib from that point onwards.

If I can make a brief comment, this was a fun little project I’m thinking of expanding, and if anyone is interested in replicating it themselves I can highly recommend it as a practical and rewarding introduction to coding and webscraping in general.

xoxo bothermione

--

--