A practical approach to set up a simple database architecture using SQLite and Python

The aim of this technical report is to describe the use of SQL databases and provide a simple entry point for people with little to no experience to store data using Python and SQL databases. This report is aimed at employees of Aalborg University, Department of the Built Environment (https://www.en.build.aau.dk/).


Foreword
The aim of this technical report is to describe the use of SQL databases and provide a simple entry point for people with little to no experience to store data using Python and SQL databases. This report is aimed at employees of Aalborg University, Department of the Built Environment (https://www.en.build.aau.dk/).

Summary
A step-by-step description of how to set up a database using SQLite and Python is given in this report, along with how to download the necessary software. The report is written such that people without programming knowledge can set up and retrieve data from an SQL database using Python.
SQL stands for Structured Query Language. It can be used to store, retrieve and manipulate data efficiently, such that data is not stored in local files on different computers, but can instead be stored on a common platform. This also allows users to not retrieve all the data and load it into memory, but can instead retrieve only the data that is necessary, to alleviate the data handling stage.
An example dataset is given in "Appendix A: Data for sample database" and serve as an example for the database creation, and when the data is committed to the database and retrieved from it.
The report also describes the necessary things to create the database, but only on a surface level. For complex databases, refer to the documentation and links provided in the relevant sections.

Introduction and motivations
In this technical report, the general process of making a database from scratch will be introduced, which is a useful data storage tool to centralize data from different datasets. A database makes it possible to have easy access for people with authorization, such that local copies of datasets can be avoided and, instead, have a single centralized database. It also allows for data to be retrieved without loading the entire data into memory, which is particularly useful for large datasets.
This is particularly useful for research teams that perform experiments or people working with data provided by the industry, as a lot of data is usually generated for different purposes. As large volumes of data are usually generated, a standardizes method of data storage is preferred to local copies of datasets in popular spreadsheet software like Excel.
This technical report provides the steps needed to make a database from scratch, with little to no programming experience in Python. This technical report will describe the steps to make a local database in SQLite, and to store and retrieve data from a database. For storage on a server, only a minimal amount of change is needed to commit the data for a global centralized database. For a server database, talk with the IT manager in charge of databases.
A secondary purpose is to establish a database for district heating data and making a script to commit and retrieve data from the database. This is presented in Appendix C.1.
For more detailed and complex databases, refer to the relevant documentation in the different sections.

Building a database using SQLite
This section will describe the steps needed to build the actual database.

Required programs
A few different programs are needed to construct the database, send data to the database and retrieve it.

Python
The main language is Python, which is needed throughout the tutorial. Python 3.10 or later versions is used in this tutorial, which can be downloaded here.
This will lead to the following link, where Python is downloaded by clicking the button shown in Figure 1. Once the installer file has been downloaded, open it. Once opened, it is important to click 'Add Python.exe to PATH' (see Figure 2).

Figure 2. Window to install Python and recommended settings.
Follow the remaining installation steps until Python is fully installed.

PyCharm Community Edition
Besides Python, an Integrated Design Environment (IDE) is needed. For this, PyCharm Community can be used, which can be downloaded here. PyCharm is where the actual code is written, while Python is the language used. Click on the community version (see Figure 3).

Required packages in Python
Besides the actual programming language, which is Python, certain packages are needed. For creating the database, the Python package 'SQLAlchemy' is used. The documentation can be found here. Knowledge of how to install packages in Python is assumed.
To read a .csv file, the Python package 'csv' is used. The documentation can be found here.

Creating a Python project and script
Before coding, a Python project and script has to be created. For this, open PyCharm Community Edition. If this is the first time PyCharm is opened, it will have a default window showing how to create a project. If not, go to the upper left corner and click on 'File' followed by 'New project' (see Figure 4). Click on file -> New Project, which opens the page shown below. Choose a name and location for the project and click 'Create'. After this, right-click on the project folder to the left, click 'New' and then 'Python file'. After clicking on the 'Python File' button, a window will appear. Toggle the 'Python File', give the file a name and press 'Enter'. A new python file has been created, where code can be written inside.

SQLite
Finally, a database is needed. This can either be a local database or a database on a server. The overall structure of the code is not changing, only which database a connection should be established to. Therefore, in this technical report, a local database is used to illustrate the principle of the database. An example of a local database is SQLite, which can be downloaded here.

The 'Standard Installer' should be used, with the correct bit-version of Windows (64-bit is most common)
. This is highlighted in Figure 8.

Setting up a database in SQLite
In this technical report, a local database is used. It is stored locally on the computer of the user. As this database is used in this technical report, the prerequisites of setting up a database is shown, along with a table in that database. After installation, open 'DB Browser (SQLite)', and click on 'New Database' as shown in Figure 9. After the database has been saved in the project folder, a window named 'Edit table definition' appears. This is initializing a table in the database. Give the database a name ('tutorial.db' is used in this example). For each parameter in your data, a new field is created. Click 'Add' to make a new field (the columns in the table).  In this case, four parameters are shown, along with some attributes for each parameter, which are described in Table 1.

Attribute Explanation Name
The name of the specific column in the table  Type The data type of the data in the specific column. Has the following options (in Indicates that the column will not allow storing NULL values.

Primary Key (PK)
A primary key is a field in a table which uniquely identifies each row in the table. Auto increment (AI) Used to automatically increment a value in a field.

Unique (U)
Ensures that all values in the column are distinct from one another.
After initializing the table in the database, you are ready to begin building the database and querying data to it from Python.

Programming the database
This section will describe the step-by-step construction of the database. This is done using snips of code in Python. For more information, see the 'ORM Quick Start' guide in the SQLAlchemy documentation.
This description will be based on a sample script, which can be found in appendix A. The purpose is to create a table in a database, and to name it 'employee_data' with employee information from a file named 'demo.txt'.
Firstly, import the necessary packages. When using functions from the 'sqlalchemy' package, use the name 'db' followed by a dot notation.
import sqlalchemy as db from sqlalchemy.orm import Session, declarative_base import csv Create the engine. When using SQLite, use the following notation. If using something different than SQLite, refer to the documentation on how to create the engine. In this case, a database named 'tutorial.db' has been initialized, see section 4.4. engine = db.create_engine("sqlite:///tutorial.db") Establish a connection to the database, such that it is possible to communicate with it. Firstly, the table name is defined, which is the name that will show up in the database. A meaningful name should be given to the columns, and without using special characters. After this, the SQLAlchemy package is called (as db), and a column is created for each parameter in the data. Each column is also given a datatype, such as Integer, Float, Text or DateTime. Other attributes that can be given to the parameters include the ones shown in Table 1.
The class is now defined with the relevant parameters and corresponding attributes, and the actual data should now be queried to the database. To do this, first generate the schema (table) in the target database.

Base.metadata.create_all(engine)
To send the data to the database, the data should be read into PyCharm as a variable. For a .txt file (a typical file extension), one approach is shown below. It is important to note the delimiter used in the data file, as wrongly assigning a delimiter in the above code will make the code read the file incorrectly. Different kinds of delimiters and the corresponding code after the keyword 'delimiter' are listed in Table 2.  Table 2, but these are the most common ones.

OBS: It is possible to have a different delimiter than the one shown in
Now, send all the data to the target database. This can be done in several different ways, one of which is using the Session object from SQLAlchemy. This will add data, line-by-line to the query (session.add(query_data)) and then commit (upload the data) to the database.

OBS: Data can be committed in chunks for large amounts of data, to significantly speed up the query time.
Commit the data after some amount of data (e.g., 1000 or less/more) using the modulus operator in Python and commit again in the end to catch all the remaining data.
with Session(engine) as session: for item in range(len(data)):  Figure 11. Figure 11. Overview of the demo database in SQLite.
With these steps, it is possible to make a database and upload data to a database. The overall structure of the code is similar if it were to be done on an online server, the connection string would just be slightly different, as mentioned at that specific step in this example. The structure of the data can be found in "Appendix A: Data for sample database" while the code in its entirety can be found in "Appendix B: Code for sample database".

Accessing data using SQL
This section will describe how to retrieve data from the database using Structured Query Language (SQL). With SQL, it is possible to access a database and retrieve only the data that fulfils the information given in the SQL-command. It is thus possible to filter data, such as retrieving data between some time interval in a specific column in a table of a specific database. While it is possible to use SQL to insert and modify a database, this section will be focused only on how to retrieve data. This section will not go in-depth with the entirety of the SQL language but will instead provide the basics for understanding SQL and show simple commands to retrieve data. For more information on SQL queries and how to construct them, go to W3Schools.

Basics of SQL statements
SQL statements are commands that are sent to a database, or performed directly in the database, to either modify the database or retrieve data. SQL statements are NOT case sensitive but will be written in uppercase in this report for clarity.
Some databases require a semicolon after each SQL statement, which is the standard way to separate each new SQL statement, if the database allows for more statements to be executed in a single call to the database.
The most important SQL statement to retrieve data, is the SELECT command, which lets the user select some data from a table in a database. This could be from a specific column in the table, all columns in the table, distinct values, etc.
An SQL query consists of different components. It needs information about which action to perform, such as retrieving data with SELECT, which columns should be considered, which table in the database that contains these columns, and a filtering criteria.
The following example shows how to retrieve name and title from the data shown in "Appendix A: Data for sample database". The data will represent a database, henceforth referred to as "employee_data".

SELECT NAME, TITLE FROM employee_data
Here, NAME and TITLE are the column names, and employee_data is the database name. If all columns in the database are of interest, the asterisk is used, as shown in the following SQL statement.

SELECT * FROM employee_data
Note that using the asterisk is usually not desired, as SQL is a useful tool for big datasets, such that all data does not need to be loaded into memory. Therefore, limit the query to the data that is actually needed.
The SELECT statement is the simplest and most used statement. From this statement, it can be further elaborated depending on the complexity of the desired query. This could be due to a filtering criterion, such as a specific time frame, only wanting values within a certain interval or looking for extreme values above some threshold. Multiple criteria can be combined using the AND or the OR operators. An example is shown below.
SELECT NAME FROM employee_data WHERE NAME = 'Martin V' OR NAME = 'Simon' Similarly, a NOT operator can be used to exclude certain rows, see below.
SELECT NAME FROM employee_data WHERE NOT NAME = 'Martin F' Finally, if data between two dates is needed, the BETWEEN command can be used. This command works for numeric, text and dates. Below is an example for dates.
SELECT NAME FROM employee_data WHERE HIRE_DATE BETWEEN '2020-01-01' AND '2023-01-01' Now, simple SQL commands have been shown and how to use them on the simple dataset in Appendix A. A lot more can be done using SQL, such as modifying the dataset and creating complex filtering queries.

How to use SQL statements with Python
This section presents how to perform programmatically the SQL commands using Python. The Python package used for this is 'Pandas', which is a commonly used library for data treatment and data management.
Firstly, the relevant Python packages should be installed and imported to your Python code.
import pandas as pd import sqlalchemy as db The next thing to do is to establish a connection to the database that has the data to be retrieved. engine = db.create_engine("sqlite:///tutorial.db") connection = engine.connect() Next, a SQL statement is constructed. The statement depends on the purpose, which filtering criteria is applied, etc. The basics are shown in section 5.1. An example statement is used to retrieve the names of employees hired between two dates using Python (see below). stmt = "SELECT NAME FROM employee_data WHERE HIRE_DATE BETWEEN '2020-01-01' AND '2023-01-01'" The statement can be combined using a dynamic string in Python (also called an f-string), such that variables can be defined earlier in the script and inserted into the string. Creating and handling f-string is widely documented on the Internet. The next step is to send the SQL statement to the database using the Pandas library in Python. The following command will use the Pandas library to send the statement (the previously named stmt variable) to the database. The data is stored into a dataframe (Pandas data type) called data. data = pd.read_sql_query(stmt, connection) Using these few lines of code, data can be retrieved from the database using Python.

Appendix A: Data for sample database
This appendix includes sample data that can be used when following this technical report. It includes four columns: 'ID', 'NAME', 'TITLE' and 'HIRE_DATE'. The 'ID' column has integers, while the remaining are consisting of strings.
To use this data, copy the data into 'Notepad' and save the file in the Python project folder. The name of the file used in this technical report in the example, is named 'demo.txt'.