Accio Database

01 Nov 2024 · #api #python #data-engineering

Listen up, you witches and wizards. Today we're summoning a database.

Here's the setup:

  • We're sourcing data from the Potter DB API.
  • We'll use python to grab everything we can from the API and store the data as CSV files.
  • Then we'll push the CSVs into a SQLite database using SQLAlchemy (SQLWizardry?) models.

By the end of this, you'll have a neat little database you can use for your programming (wizarding?) projects.

Grab the portkey and let's get going!

Part 1: Scraping the API (Accio Data)

Potter DB is a gem on the Internet. It's a project that scrapes the Harry Potter wiki and makes its data available for programmatic use. The API has five endpoints: book, character, movie, potion, and spell. Our first script scrape_api.py will send requests to these endpoints and save the response in CSV files.

To avoid repeating ourselves for each of the endpoints, we'll use parameters as much as possible. The schemas.py file is the Maurader's Map of what endpoints we need to hit and what we need to look for when we get there. It's the file that essentially pumps the parameters with values.

# schemas.py
schemas = {
    #...
    "spell": {
        "api_endpoint": "v1/spells",
        "api_query": {"page[size]": 100},
        "attributes": [
            "slug",
            "name",
            "incantation",
            "category",
            "effect",
            "light",
            "hand",
            "creator",
            "image",
            "wiki",
        ],
    },
    #...
}

Like almost all REST APIs, Potter DB returns JSON objects, which can contain nested values. The Movie and Character objects have nested arrays to account for things like multiple producers or multiple love interests. Our database will feature relational tables, so we need to do a bit of transfiguration.

To create relational tables, the nested arrays will be normalized into separate tables. For example, each character's array of romances will be written to a separate table called character_romances. When querying later, joins can be created between character and character_romances as needed.

Data transfiguration

Lastly, there's no guard dog named Fluffy (i.e. authentication requirements) protecting the API. Instead the API limits usage by tracking your IP address. If we accidentally overuse the API and get 429 errors, the script uses exponential backoff to try again later.

Alright, let's cast the first spell!

Run scrape_api.py

Part 2: Loading the Database (Wingardium Leviosa)

Now that we have our data in CSV format, we can load it into a database. Our second script seed_db.py uses SQLAlchemy to initialize a SQLite database. The script then reads the CSV files and inserts records into the database.

The file models.py defines our tables as SQLAlchemy models, which describe our columns and foreign key relationships. Here's an example of the Book model:

# models.py
# ...
class Book(Base):
    __tablename__ = "book"

    id = mapped_column(String, primary_key=True)
    slug = mapped_column(String, unique=True, nullable=False)
    title = mapped_column(String, unique=True, nullable=False)
    summary = mapped_column(String)
    author = mapped_column(String)
    release_date = mapped_column(DateTime)
    dedication = mapped_column(String)
    pages = mapped_column(Integer)
    cover = mapped_column(String)
    wiki = mapped_column(String)

    chapters = relationship(
        "Chapter",
        lazy="joined",
        order_by="Chapter.order",
        cascade="all, delete-orphan",
    )

    def __repr__(self):
        return f"<Book(id='{self.id}', slug='{self.slug}')>"
# ...

Before we load data into our tables, we need to prepare for some stricter data modeling rules, like data types and uniqueness.

First, everything in a CSV file is a string. We need to make sure any datetimes in the CSV are interpreted as true datetime objects. The script performs this parsing and type conversion when creating the ORM object for each row.

Second, we need to handle any duplicate records returned by the API. As the script walks through the CSV file, it logs a hash of the row's ID or the entire row itself. If a duplicate record is found, it is not written to the database; instead the record is sent to a text file errors.txt for later review.

Step right up and cast the second spell! Watch the CSV file content levitate into the database.

Run seed_db.py

Log into the database at ./data/sqlite/potter.db to make sure everything looks good. Give it a whirl and test out a query:

SELECT *
FROM character
WHERE slug IN (
    'harry-potter',
    'hermione-granger',
    'ronald-weasley',
    'albus-dumbledore'
);
id                                    slug              name                                     born                                                                      died                                                                                           gender  species  height  weight                                    hair_color                   eye_color     skin_color  blood_status  marital_status  nationality  animagus  boggart                           house       patronus              image                                                                                            wiki                                                
------------------------------------  ----------------  ---------------------------------------  ------------------------------------------------------------------------  ---------------------------------------------------------------------------------------------  ------  -------  ------  ----------------------------------------  ---------------------------  ------------  ----------  ------------  --------------  -----------  --------  --------------------------------  ----------  --------------------  -----------------------------------------------------------------------------------------------  ----------------------------------------------------
a57de83d-2a44-40d4-8060-75895fa756f5  albus-dumbledore  Albus Percival Wulfric Brian Dumbledore  Between 16 and 31 August 1881, Mould-on-the-Wold, England, Great Britain  30 June 1997 (aged 115), Astronomy Tower, Hogwarts Castle, Highlands, Scotland, Great Britain  Male    Human    5'11'   175 lbs (formerly; 1932), 168 lbs (1997)  Auburn (originally), Silver  Blue          Fair        Half-blood    Single          English                The corpse of his sister, Ariana  Gryffindor  Phoenix               https://static.wikia.nocookie.net/harrypotter/images/7/75/Albus_Dumbledore_%28HBPF_promo%29.jpg  https://harrypotter.fandom.com/wiki/Albus_Dumbledore
bcae9def-6584-4300-ac63-ff007974bf3c  harry-potter      Harry James Potter                       31 July 1980, Godric's Hollow, West Country, England, Great Britain                                                                                                      Male    Human                                                      Jet-black                    Bright green  Light       Half-blood    Married         English                Dementor                          Gryffindor  Stag                  https://static.wikia.nocookie.net/harrypotter/images/c/ce/Harry_Potter_DHF1.jpg                  https://harrypotter.fandom.com/wiki/Harry_Potter    
9a992090-02b8-4c89-9e6a-bdaa32404c69  hermione-granger  Hermione Jean Granger                    19 September 1979, England, Great Britain                                                                                                                                Female  Human    5'5'    118 lbs                                   Brown                        Brown         Pale        Muggle-born   Married         English                Failure                           Gryffindor  Otter                 https://static.wikia.nocookie.net/harrypotter/images/3/34/Hermione_Granger.jpg                   https://harrypotter.fandom.com/wiki/Hermione_Granger
d67ddb5d-192b-4d75-8f18-c55a1b5ce442  ronald-weasley    Ronald Bilius Weasley                    1 March 1980, Ottery St Catchpole, Devon, England, Great Britain                                                                                                         Male    Human    5'9'    152 lbs                                   Red                          Blue          Pale        Pure-blood    Married         English                Aragog                            Gryffindor  Jack Russell terrier  https://static.wikia.nocookie.net/harrypotter/images/4/44/Ronald_Weasley_DHF1.jpg                https://harrypotter.fandom.com/wiki/Ronald_Weasley

Mischief managed. You just used magic (programming) to capture data from an API and bottle it away in your private database. Yer a wizard, Harry.

Want to go deeper as you study for your OWLs? Download the code and check out the README.

On that note, send me an owl when you need help with your next magical challenge.