Using Databases
This exercise will introduce relational and unstructured databases. We will use a dataset kindly shared by Jan Aerts, the leader of the Data Visualization Lab at KU Leuven, Belgium. He introduces Mongo in this exercise however we will start with relational databases.
To complete this tutorial, we will use a few different small datasets available alongside this notebook. We will use the following environment.
# environment for elements of scientific computing (esc) database tutorial
name: esc03
dependencies:
- python=3.7
- conda-forge::mongo-tools
- conda-forge::sqlite
- pymongo
- sqlalchemy
- nb_conda_kernels
- jupyter
- jupyterlab
- pip
You can build this environment from these instructions. On Blue Crab this
environment can be accessed with ml anaconda
and conda activate databases-
examples
.
Relational databases
If you’ve used a spreadsheet, then you understand how a relational database
works. A relation is basically a sheet or table in a spreadsheet which defines
a set of tuples (rows) that all have the same attributes (columns). A relational
database is composed of many such tables along with unique ID numbers for each
row. These IDs allow for relations between tables known as a JOIN
clause.
An introduction to sqlite
In the following exercise we will build a simple database. Our data comes from a
file called beers.txt
(courtesy of Jan Aerts). Most of the following exercise will
be completed entirely in BASH and recapped in the steps below. Note that a $
indicates the BASH prompt, however most of this exercise occurs inside the
sqlite3
program, which has its own prompt.
Source data
The text in beers.txt
includes the beer name, ABV, and brewery, in the
following format:
$ head beers.txt
3 Schténg|Brasserie Grain d'Orge|6.0
400|'t Hofbrouwerijke voor Brouwerij Montaigu|5.6
IV Saison|Brasserie de Jandrain-Jandrenouille|6.5
V Cense|Brasserie de Jandrain-Jandrenouille|7.5
VI Wheat|Brasserie de Jandrain-Jandrenouille|6.0
Aardmonnik|De Struise Brouwers|8.0
Aarschotse Bruine|Stadsbrouwerij Aarschot|6.0
Abbay d'Aulne Blonde des Pères 6|Brasserie Val de Sambre|6.0
Abbay d'Aulne Brune des Pères 6|Brasserie Val de Sambre|6.0
Abbay d'Aulne Super Noël 9|Brasserie Val de Sambre|9.0
We have used a vertical bar |
to separate the records in case a comma exists
in any of the names.
Building the database
- Make a new database. This command opens the
SQLite
terminal.
$ sqlite3 beers.sqlite3
SQLite version 3.30.1 2019-10-10 20:19:45
Enter ".help" for usage hints.
sqlite>
- Create a table for the breweries. You can extend your output over multiple lines since they are terminated with a semicolon.
sqlite> CREATE TABLE brewery (
name VARCHAR(128)
);
sqlite> .tables
brewery
sqlite> .schema brewery
The command above can repeat the schema back to you.
- Drop a table.
sqlite> DROP TABLE brewery;
sqlite> .tables
- Make the table again and include a second table to define the beer and its
alcohol content as a
REAL
type, that is, a floating point number. Note that we are also including a brewery ID and a “foreign key” which we will explain in class.
sqlite> CREATE TABLE brewery (
name VARCHAR(128)
);
sqlite> CREATE TABLE beer (
name VARCHAR(128),
abv REAL,
brewery_id INTEGER,
FOREIGN KEY(brewery_id) REFERENCES brewery(rowid)
);
.schema
- Now we will insert some data and read it back. The
SELECT
command creates a database “query”.
sqlite> INSERT INTO brewery VALUES('Dogfishead');
sqlite> INSERT INTO brewery VALUES('Tired Hands');
sqlite> SELECT rowid, name from brewery;
- Now we can insert a beer into the database and associate it with a brewery.
sqlite> INSERT INTO beer VALUES ('90 Minute IPA', 9.0, 1);
sqlite> INSERT INTO beer VALUES ('60 Minute IPA', 6.0, 1);
sqlite> INSERT INTO beer VALUES ('HopHands', 5.5, 2);
sqlite> INSERT INTO beer (name) VALUES ('SaisonHands');
- We can easily search the database for beers from a particular brewery.
sqlite> SELECT rowid, name FROM beer WHERE brewery_id=1;
- We can query multiple tables at once using the
INNER JOIN
syntax. In the following example, we will collect all pairs of brewery names and ABV values by joining the beer’s brewery ID number with the row ID on the brewery table.
sqlite> SELECT brewery.name,beer.abv FROM beer INNER JOIN brewery ON
brewery.rowid=beer.brewery_id;
Dogfishead|9.0
Tired Hands|5.5
Dogfishead|6.0
If we had created a more complex database, we could use a sequence of INNER
JOIN
to pull data from multiple tables.
- We can modify the tables in-place.
sqlite> ALTER TABLE beer ADD COLUMN style;
sqlite> UPDATE beer SET style="IPA" WHERE name="90 Minute IPA";
sqlite> select * from beer;
The exercise above covers the basics for interacting with sqlite3
directly in
a terminal. You are welcome to set up a BASH script to ingest and query the
data. For some use-cases, it may be easier to use an interface. In the next
exercise we will examine one type of interface.
Scripting your sqlite
workflows
In this example we will repeat some of the work above using sqlalchemy
from
this documentation.
In [1]:
The sqlalchemy
library is a front-end for many different types of databases.
The purpose of this library, and others written in different langauges, is to
abstract the programmer from the details of the database. You can use a database
ORM (object-relational mapping) to design a database in your preferred language.
After you work with an ORM for a while, you might choose a different database
driver or back-end to suit your performance needs.
In this exercise we will use sqlalchemy
to design our database directly in
Python and ingest some of the data described in the sqlite
exercise above.
Since Python makes it somewhat easy to interact with strings, it can reduce the
labor required to ingest the data.
First, we create an “engine” with the following commands. Note that you can use
other drivers in place of sqlite
below. The following command creates the file
beer_alchemy.sqlite
on disk, however you are welcome to use :memory:
instead.
In [2]:
Next we must make a “session” to interact with the ORM.
In [3]:
We design our database by sub-classing a delarative_base
. All of our
interactions with the database are abstract, and occur implicitly when we
interact with the library objects (namely Base
).
In [4]:
Now that we have set the stage, we are ready to design our tables, which are represented as Python classes. One major downside to using an ORM is that this particular library does not allow for database migrations, hence you must regenerate the database if you wish to modify it. This imposes an important type of discipline on your workflow, however all databases must eventually be migrated, one way or the other.
In [5]:
In [6]:
Defining the class is not enough to actually build the table. For that, we
must communicate with the engine. Since we asked the engine to echo our results,
we can see how sqlalchemy
is directly manipulating the database.
In [7]:
2019-11-19 20:25:38,928 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2019-11-19 20:25:38,929 INFO sqlalchemy.engine.base.Engine ()
2019-11-19 20:25:38,930 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2019-11-19 20:25:38,930 INFO sqlalchemy.engine.base.Engine ()
2019-11-19 20:25:38,931 INFO sqlalchemy.engine.base.Engine PRAGMA main.table_info("brewery")
2019-11-19 20:25:38,932 INFO sqlalchemy.engine.base.Engine ()
2019-11-19 20:25:38,933 INFO sqlalchemy.engine.base.Engine PRAGMA temp.table_info("brewery")
2019-11-19 20:25:38,934 INFO sqlalchemy.engine.base.Engine ()
2019-11-19 20:25:38,935 INFO sqlalchemy.engine.base.Engine PRAGMA main.table_info("beer")
2019-11-19 20:25:38,936 INFO sqlalchemy.engine.base.Engine ()
2019-11-19 20:25:38,937 INFO sqlalchemy.engine.base.Engine PRAGMA temp.table_info("beer")
2019-11-19 20:25:38,937 INFO sqlalchemy.engine.base.Engine ()
2019-11-19 20:25:38,938 INFO sqlalchemy.engine.base.Engine
CREATE TABLE brewery (
id INTEGER NOT NULL,
name VARCHAR,
PRIMARY KEY (id)
)
2019-11-19 20:25:38,939 INFO sqlalchemy.engine.base.Engine ()
2019-11-19 20:25:38,941 INFO sqlalchemy.engine.base.Engine COMMIT
2019-11-19 20:25:38,942 INFO sqlalchemy.engine.base.Engine
CREATE TABLE beer (
id INTEGER NOT NULL,
name VARCHAR,
abv FLOAT,
brewery_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY(brewery_id) REFERENCES brewery (id)
)
2019-11-19 20:25:38,942 INFO sqlalchemy.engine.base.Engine ()
2019-11-19 20:25:38,944 INFO sqlalchemy.engine.base.Engine COMMIT
We are finally ready to enter some data using the somewhat elegant Python syntax.
In [8]:
None
The beer has no ID yet because all of our Python objects are not permanently added to the database until we commit the results. The following command will finalize this transaction.
In [9]:
2019-11-19 20:25:38,956 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2019-11-19 20:25:38,957 INFO sqlalchemy.engine.base.Engine INSERT INTO beer (name, abv, brewery_id) VALUES (?, ?, ?)
2019-11-19 20:25:38,958 INFO sqlalchemy.engine.base.Engine ('Rt. 113 IPA', 7.0, None)
2019-11-19 20:25:38,959 INFO sqlalchemy.engine.base.Engine COMMIT
In [10]:
Now that this object is entered, we can query the table to return it.
In [11]:
[<Beer name=Rt. 113 IPA>]
Ingesting data
One reason to move from the sqlite
terminal or BASH to a higher-level language
like Python is to make it easy to ingest some text. The following example comes
from the top of our beers.txt
file, which has itself been extracted for this
exercise from a source database. The data below include the beer name, brewery,
and alcohol by volume (ABV).
In [12]:
Let us ingest the data directly from this text file. We will discuss the
following code block in class. In short, we interpret the text file (which has
three columns) and then query the Brewery
and Beer
tables, represented by
Python classes. If a particular brewer or beer is missing, we add the objects
and commit. After this step is complete, we can read out all of the rows from
this limited data set.
In [13]:
[<Beer name=Rt. 113 IPA>,
<Beer name=3 Schténg>,
<Beer name=400>,
<Beer name=IV Saison>,
<Beer name=V Cense>,
<Beer name=VI Wheat>,
<Beer name=Aardmonnik>,
<Beer name=Aarschotse Bruine>,
<Beer name=Abbay d'Aulne Blonde des Pères 6>,
<Beer name=Abbay d'Aulne Brune des Pères 6>,
<Beer name=Abbay d'Aulne Super Noël 9>]
Having proved that this method works, we can ingest all of the data from the source. It’s best to package these operations into separate functions.
In [14]:
In [15]:
we have 1691 records
In [16]:
In [17]:
In [18]:
[====================] 100%
In [19]:
found 256658 records
In [20]:
(<Brewery name='t Hofbrouwerijke voor Brouwerij Montaigu>, <Beer name=V Cense>)
The number of records returned from our search above seems absurdly high. This is an example of an accidental outer join which was the result of clumsy code on my part! Let’s try again, and limit our “join” to the desired target: beer.
In [21]:
found 71 records
This is much more reasonable, particularly since our number of results does not
exceed the number of records. To find the best breweries (i.e. those with the
highest-octane beers), we can query both the Beer
and Brewery
objects
while joining only on Beer.
In [22]:
found 150 records
Next we can sort these to find the booziest beers, paired with their brewers.
In [23]:
9.20 <Beer name=Keyte-Dobbel-Tripel> <Brewery name=Brouwerij Strubbe>
9.20 <Beer name=Rochefort 8> <Brewery name=Abdij Notre-Dame de Saint-Rémy>
9.20 <Beer name=Strandjuttersbier Mong De Vos> <Brewery name=Brouwerij Strubbe voor bierfirma Den Haene>
9.40 <Beer name=Bière du Corsaire Cuvée Spéciale> <Brewery name=Brouwerij Huyghe>
9.50 <Beer name=Abbaye des Rocs Grand Cru> <Brewery name=Brasserie de l'Abbaye des Rocs>
9.50 <Beer name=Achel Blond Extra> <Brewery name=Achelse Kluis>
9.50 <Beer name=Achel Bruin Extra> <Brewery name=Achelse Kluis>
9.50 <Beer name=Authentic Triple> <Brewery name=Authentique Brasserie>
9.50 <Beer name=Bersalis Tripel> <Brewery name=Brouwerij Huyghe voor Brouwerij Oud Beersel>
9.50 <Beer name=Boerinneken> <Brewery name=De Proefbrouwerij voor Den Ouden Advokaat>
Note that you are welcome to unpack and manipulate this data in Python, however databases are strictly intended to do the heavy lifting for you. Most databases are useful not only because they allow you to develop complex data structures, but because their performance far exceeds the in-memory performance of bog standard Python.
Unstructured databases: Mongo
For the remainder of this exercise, we will use MongoDB. This database is a “noSQL” or unstructured or non-relational database. We offer this exercise to compare its usefulness to the relational databases.
Raw data
The raw data for the exercise come from the same source as the file above using these commands on a machine with Docker.
docker run -d -p 27017:27017 jandot/mongo-i0u19a
mongoexport --db=i0u19a --collection=beers --out=beers.json
mongoexport --db=i0u19a --collection=breweries --out=brewers.json
docker stop de69c66b8d91 # get the correct name from docker ps
The brewers.json
is provided with this repository and is thanks to Jan
Aerts.
Starting mongo
In class we will review the use of a screen
to start Mongo using the following
commands.
screen -S mongo
mkdir -p data_mongo
mongod --dbpath=data_mongo
# use the following on Blue Crab to avoid port collisions with other students
mongod --dbpath=data_mongo --port=$(shuf -i8000-9999 -n1)
# note the default port (27017) or a random one via shuffle is required later on
If you are using Blue Crab you should perform this exercise on an interactive
session using interact -p express -c 6 -t 120
. Do not forget to load a conda
environment with the dependencies listed at the beginning of this tutorial.
At the end of the exercise we will use the Mongo shell, but for now, we will use
a Python interface. In contrast to sqlalchemy
, the pymongo
database is very
similar to the mongo interface itself.
Ingest the data
Once you have started a mongo daemon in the background, we are readyt o start
using the database. Note that if you use docker or Singularity directly, the
docker run
command above will make the data available automatically.
In [24]:
To complete the exercise we must create a database and a collection.
In [25]:
First, we should take a look at the raw data which we exported for this exercise.
In [26]:
{"_id":{"$oid":"5dd300d16881a20dc8b96777"},"beer":"3 Schténg","brewery":"Brasserie Grain d'Orge","type":["hoge gisting"],"alcoholpercentage":6.0}
{"_id":{"$oid":"5dd300d16881a20dc8b96778"},"beer":"IV Saison","brewery":"Brasserie de Jandrain-Jandrenouille","type":["saison"],"alcoholpercentage":6.5}
{"_id":{"$oid":"5dd300d16881a20dc8b96779"},"beer":"V Cense","brewery":"Brasserie de Jandrain-Jandrenouille","type":["hoge gisting","special belge"],"alcoholpercentage":7.5}
We can see that each JSON entry includes a beer and “type” along with the ABV. The JSON syntax is a useful way to represented a nested dictionary or tree of values.
In [27]:
In [28]:
Next we will unpack each line with JSON and use insert_one
to directly add
them to the database. Since this database is unstructured, we do not have to
define the schema ahead of time.
In [29]:
In [30]:
1691
The syntax is slightly different, but as with the relational database, queries are relatively easy to write.
In [31]:
there are 168 breweries with boozy beers
We can sort the results of our query directly in Python.
In [32]:
[('Black Damnation V (Double Black)', 26.0),
("Cuvée d'Erpigny", 15.0),
('Black Albert', 13.0),
('Black Damnation I', 13.0),
('Black Damnation III (Black Mes)', 13.0),
('Black Damnation IV (Coffée Club)', 13.0),
('Bush de Noël Premium', 13.0),
('Bush de Nuits', 13.0),
('Bush Prestige', 13.0),
('Cuvée Delphine', 13.0)]
Alternately, we can chain together a series of queries to filter the data. In the following example we find all beers above 8% ABV, group them by brewery, take the average, and then collect a sample of five.
In [33]:
09.20 Brasserie d'Ecaussinnes
09.25 Brasserie de Silly
08.40 Brouwerij Het Anker (vroeger in Brouwerij Riva en Brouwerij Liefmans)
09.00 Huisbrouwerij de 3 vaten
09.75 Brouwerij Val-Dieu
The syntax above, including the use of $group
and $match
is reviewed in the
list of accumulators
here.
Review the Mongo shell
Before continuing, it is useful to check on the database directly with mongo
.
The commands above can also be completed directly inside Mongo using some of the
following commands. Make sure your daemon is running first.
mongo --host localhost
show dbs
use beer_survey
db.beers.findOne()
db.beers.aggregate([
{$match: {abv: {$gt: 8}}},
{$group: {_id: "$brewery", avg: {$avg: "$abv"}}},
{$sample: {size: 5}}
])
When designing your calculations, you should choose the combination of database and ORM or interface that makes it easiest to manipulate your data.
Advanced filtering
In our final example, we will use map reduce to count the number of beers per
brewery. This requires some JavaScript implemented with the bson
library,
since this is Mongo’s native language.
In [34]:
In [35]:
In [36]:
In [37]:
[{'_id': 'Brouwerij Huyghe', 'value': 43.0},
{'_id': 'Brouwerij Van Honsebrouck', 'value': 36.0},
{'_id': 'Brouwerij Van Steenberge', 'value': 32.0},
{'_id': 'Brouwerij De Regenboog', 'value': 31.0},
{'_id': 'Brouwerij Alvinne', 'value': 30.0}]
The map-reduce framework is a general one which can be used in many other contexts.
That’s all for today! This exercise only scratched the surface. Databases are extremely useful tools that help to extend your programs to accomodate larger and more complex data structures.