In the world of data, databases play a crucial role in harnessing information. In this article, we’ll learn how to create a database using Python and SQLite3, a lightweight database ideally suited for prototyping minimum viable products (MVPs). It’s important to note that SQL stands for Structured Query Language, and SQLite3 is just one flavor of SQL databases. Other flavors worth exploring include PostgreSQL, MySQL, and SQL Server (Microsoft’s SQL-specific database).
For this use case we are going to parse data from a CSV file and import the contents into a SQLite3 database, we’ll write this script using Python. Lets start by importing the packages, and specifying the csv file. For this I’m using the file that can be found here: https://github.com/adamtlee/pokemon-etl-project-blog/blob/main/output/output_2024-05-24_13-05-50.csv
import sqlite3
import csv
# Path to the CSV file
file_path = "output/output_2024-05-24_13-05-50.csv"
# Open the CSV file
with open(file_path, mode='r', encoding='utf-8-sig') as file:
csv_reader = csv.reader(file)
header = next(csv_reader) # Skip the header row
Next we can connect to the database called pokemon_etl_project.db, or use python to create the database if that database doesn’t yet exist:
# Connect to SQLite3 database (or create it if it doesn't exist)
conn = sqlite3.connect('pokemon_etl_project.db')
cursor = conn.cursor()
After the database is open and connected to, we can create a table within the database if the table doesn’t yet exist, that table will be called pokemon_data and have data attributes of the following:
- name : TEXT
- id : INTEGER (also the primary key)
- height : INTEGER
- weight : INTEGER
- base_experience : INTEGER
- types : TEXT
# Create a table based on the CSV header
create_table_query = """
CREATE TABLE IF NOT EXISTS pokemon_data (
name TEXT,
id INTEGER PRIMARY KEY,
height INTEGER,
weight INTEGER,
base_experience INTEGER,
types TEXT
)
"""
cursor.execute(create_table_query)
Once we have the table created within the SQLite database, we can persist the data from our csv file to populate the database, after persisting the data, we’ll close the connection to the databse and print a statement to the console to notify the users that the data has been inserted correctly.
# Insert data into the table
for row in csv_reader:
try:
# Clean the data
row = [x.strip() for x in row]
cursor.execute("""
INSERT INTO pokemon_data (name, id, height, weight, base_experience, types)
VALUES (?, ?, ?, ?, ?, ?)
""", (row[0], int(row[1]), int(row[2]), int(row[3]), int(row[4]), row[5]))
except ValueError as e:
print(f"Skipping row due to error: {e}")
continue
# Commit the transaction and close the connection
conn.commit()
conn.close()
print("Database created and data inserted successfully.")
After this script has been executed and our SQLite database is created and populated, we can check that our database is populated with the correct data from the CSV file with the script below:
import sqlite3
# Connect to the SQLite3 database
conn = sqlite3.connect('pokemon_etl_project.db')
cursor = conn.cursor()
# Query to select all data from the pokemon_data table
query = "SELECT * FROM pokemon_data"
# Execute the query
cursor.execute(query)
# Fetch all rows from the executed query
rows = cursor.fetchall()
# Print the column headers
column_names = [description[0] for description in cursor.description]
print(column_names)
# Print each row
for row in rows:
print(row)
# Close the connection
conn.close()
The output should display something similar as listed below:
['name', 'id', 'height', 'weight', 'base_experience', 'types']
('bulbasaur', 1, 7, 69, 64, 'grass, poison')
('ivysaur', 2, 10, 130, 142, 'grass, poison')
('venusaur', 3, 20, 1000, 263, 'grass, poison')
('charmander', 4, 6, 85, 62, 'fire')
('charmeleon', 5, 11, 190, 142, 'fire')
('charizard', 6, 17, 905, 267, 'fire, flying')
('squirtle', 7, 5, 90, 63, 'water')
('wartortle', 8, 10, 225, 142, 'water')
('blastoise', 9, 16, 855, 265, 'water')
('caterpie', 10, 3, 29, 39, 'bug')
To conclude, this is a common example of how to create, and populate a database using existing data from a CSV file. In a real world example, bussiness could have an application that doesn’t support the use of an Application Programming Interface (API) where data could be extracted and ingested via an API integration.
Therefore the only way to extract the data is by manual data extraction, or features that may export the current data from an application to a CSV format. From there we can use that CSV to populate any existing databases that we need to continue with internal business requirements and documentation.
Sources:
- Find the CSV Data here: https://github.com/adamtlee/pokemon-etl-project-blog/blob/main/output/output_2024-05-24_13-05-50.csv
- The full script to create and populate the database can be found here: https://github.com/adamtlee/pokemon-etl-project-blog/blob/main/project_two/main.py
- The query for the database can be found here: https://github.com/adamtlee/pokemon-etl-project-blog/blob/main/project_two/query.py