Best practice for SQL statements in Python

Thanks to a compulsory interface for database connectors, the "Python Database API Specification v2.0, PEP249", all current connectors have been developed so that database connections and the SQLs for data retrieval and data transactions can be started using the same commands. Results are received in more or less the same format everywhere. It is regarding this issue that there seem to be the most severe deviations from the required standardisation.
But this should not scare anyone off from using Python scripts as a flexible method for automating database operations.

All connectors include an execute function which assumes an SQL statement as a string parameter and which can be executed on the part of the database. However, the use of Python doesn’t really make sense until SQLs are generated dynamically and data-driven.

It is at this point that I would like to interrupt and demonstrate various alternatives – beginning with the most simple but also the least intelligent methods – and end with a best practice for how SQL strings should be transmitted.

To begin with, I first have to make it clear that every alternative except for the final one constitute potentially dangerous gaps in security. It is possible, if no other security precautions have been taken, that sensitive data may be retrieved or even deleted.

The most naive and most dangerous approach: string concatenation

First of all, we create a test database in SQLite. The reasons I am using SQLite for the demonstration are that SQLite comes with Python, it is possible to create databases directly in the memory for the runtime of the script meaning that the examples can therefore be replicated for each one. However, I can only guarantee error-free execution of the examples starting with Python 3.x.

 import sqlite3 
db = sqlite3.connect(':memory:') 
db.execute("CREATE TABLE staff (person_id int, lastname CHAR); ") 
db.execute("INSERT INTO staff (person_id, lastname) VALUES (1, 'Pavlov') ") 
db.execute("INSERT INTO staff (person_id, lastname) VALUES (2, 'Skinner') ")

In the sample code, a nameless database was initialised in the memory by entering ':memory:' as the storage location in the connect command after importing the module.
Then an employee spreadsheet named "staff" is generated and filled with the first set of data.

So far, so good. But we don’t want to have to write an entire insert command for each and every employee that we are going to add to the spreadsheet.
If the names of the employees are already available as a list, then this really lends itself to using a loop.

First attempt:

 db.execute("CREATE TABLE staff (person_id int, lastname CHAR); ") 
for person_id, lastname in enumerate(staff_names): 
    db.execute("INSERT INTO staff (person_id, lastname) VALUES (" + person_id + ", '" + lastname + "') ")

Despite all the good intentions, it failed. The error notification "TypeError: Can't convert 'int' object to str implicitly" implies that we forgot to cast the data type of the person_id from integer to str. Although Python is flexible in almost everything, it is still a very strongly typified language and immutable strings cannot be combined with integers.

For this, the compiler will have to get up earlier. Next attempt:

 db.execute("CREATE TABLE staff (person_id int, lastname CHAR);") 
for person_id, lastname in enumerate(staff_names): 
    db.execute("INSERT INTO staff (person_id, lastname) VALUES (" + str(person_id) + ", '" + lastname + "') ")

Well, it is running but it doesn’t look nice at all. Especially when a great many of these concatenations are used in the code, then the statement is very fragmented. What’s more, I always have to take care of the type conversion myself.

The old approach: string templates with %s

When leafing through Python literature, even the more current stuff, and reading various forum entries in Stack Overflow or anywhere else, techniques can be seen that look like this in our example:

 db.execute("CREATE TABLE staff (person_id int, lastname CHAR); ") 
for person_id, lastname in enumerate(staff_names): 
    db.execute("INSERT INTO staff (person_id, lastname) VALUES (%d, '%s') " % (person_id, lastname))

Works like a dream. "%d" is the placeholder for a digit and "%s" is the placeholder for a string.

If, however, this value is required multiple times when using this notation, it becomes rather jumbled again. Let us imagine an example in which we check various conditions in a query.

 sql = """SELECT lastname , CASE WHEN %d > 10 THEN 'greater' WHEN %d = 10 THEN 'equal' WHEN %d < 10 THEN 'lesser' END vergleich FROM staff WHERE lastname <> '%s' and %d > 0 """ % (person_id, person_id, person_id, lastname, person_id)

As soon as we enter another placeholder here, the risk of error grows and we have to count off the positions in the code every time. Here it is better to select appointed placeholders.

 sql = """SELECT lastname , CASE WHEN %(person_id)d > 10 THEN 'greater' WHEN %(person_id)d = 10 THEN 'equal' WHEN %(person_id)d < 10 THEN 'lesser' END vergleich FROM staff WHERE lastname <> '%(lastname)d' and %(person_id)d > 0 """ % {'person_id': person_id, 'lastname': lastname)

Wonderful! The code is now much easier to read because we can now see what we insert and where we insert it right away.

This notation poses only one small problem, namely that is considered obsolete and, at least in Python 3, has been replaced by a better one. About 3 or 4 years ago, I read multiple times in forums that this notation was even considered deprecated. This means that it should not be used any more because its continuance in later Python versions is not guaranteed. These days, however, it hasn’t been abandoned yet – presumably because it is still very widespread in modules.

The new approach: string templates with {}

The new official notation uses curly brackets. Not only does it look different, it also harbours much more potential as regards formatting options. When the new notation is adopted anyway because it can do more, why not use it consistently?

Let us look at the simple version first:

 db.execute("CREATE TABLE staff (person_id int, lastname CHAR); ") 
for person_id, lastname in enumerate(staff_names): 
    db.execute("INSERT INTO staff (person_id, lastname) VALUES ({}, '{}') ".format(person_id, lastname))

It is important to note here that when the appearance plays no role as regards comma placement or leading zeros, etc., then there does not need to be a difference between strings or numerical values for the placeholders. Yes, it is even possible to use tuples, for example:

 db.execute("SELECT * FROM staff WHERE person_id in {}".format((1,3,4)))

The format function of the string calls the __str__ method of each object. This then corresponds to each str(object).

There is notation with labels here too but no dictionary is transmitted. Instead, the allocation is written in the form of functional parameters.

 db.execute("CREATE TABLE staff (person_id int, lastname CHAR); ") 
for person_id, lastname in enumerate(staff_names): 
    db.execute("INSERT INTO staff (person_id, lastname) VALUES ({person_id}, '{lastname}') ".format(person_id=person_id, lastname=lastname))

Everyone who is lazy about writing can also use tuple packing and tuple unpacking for themselves.

 db.execute("CREATE TABLE staff (person_id int, lastname CHAR); ") 
for row in enumerate(staff_names): 
    db.execute("INSERT INTO staff (person_id, lastname) VALUES ({}, '{}') ".format(*row))

row is a tuple in every cycle of the loop because enumerate() returns two values as a tuple. These are then forced into the variable row. With the notation ".format(*row)" the tuple can be unpacked again and the values can be called in the appropriate order.

The same works with dictionaries.

 data = {'lastname': 'Mustermann', 'person_id': 12345} 
db.execute("INSERT INTO staff (person_id, lastname) VALUES ({person_id}, '{lastname}') ".format(**data))

The most extreme example for wanting to avoid typing is this theoretical example of an insert function.

 def insert_staff(person_id, lastname, other_parameter1, other_parameter2): 
    db.execute("INSERT INTO staff (person_id, lastname) VALUES ({person_id}, '{lastname}') ".format(**globals())

Here, the format command simply retrieves the data from the variables defined in the namespace of the function call, in this case, the functional parameters. I have used this example myself in Python 2.7. In Python 3.x, however, this does not work any more and I believe that it is better this way.

Now, let’s imagine that we are getting a new employee, the fiftieth, named OʼReilly. The name is added quickly:

 db.execute("INSERT INTO staff (person_id, lastname) VALUES ({person_id}, '{lastname}') ".format(person_id=50, lastname="O'Reilly"))

SQLite complains here "sqlite3.OperationalError: near "Reilly": syntax error". What is that all about? Quite obviously, the name is added to the SQL statement but the apostrophe in the name is interpreted as the end of the name string, that is " , 'O'Reilly') ".

It is possible to find relief in the middle of a coding frenzy here by deleting the apostrophe. This varies according to database. In SQLite, the apostrophe must be doubled.

 db.execute("INSERT INTO staff (person_id, lastname) VALUES ({person_id}, '{lastname}') ".format(person_id=50, lastname="O'Reilly".replace("'", "''")))

Well, that works for now, but it's just a cheap workaround.

So, now we are getting another new employee, the fifty-first, named Mr "');DROP TABLE staff;". Weird name but if that’s how the user entered it then it must be right.

If we wouldn’t be deleting various characters from the event with Mr O’Reilly and if we wouldn’t be working with SQLite, a programme that does not allow 2 statements in an execute string, the query string would appear as follows.

 INSERT INTO staff (person_id, lastname) VALUES 
(51, '');DROP TABLE staff;

The entire spreadsheet was deleted here thanks to unsanitised user input. These SQL injections are real dangers for the security of the database. Just a few years ago, while I was trying things out, I stumbled across major online retailers that didn't even sanitise the apostrophe in their product search.

Best practice: parametrised queries

All three demonstrated alternatives work. For the sake of speed, I myself still use one notation of this sort or the other.

The very clear best practice is, however, the use of "parametrised queries". Regardless of the programming language, every database connector should support this type of query transmission. In Python, at least, I know from practical experience that it works for Oracle, MySql, SQLite and PostgreSQL.

The idea behind it is that the SQL string is not compiled entirely by one person and then transmitted to the connector, rather a template and parameters for the template are transmitted.

There are various advantages. One is that the database driver assumes all the type conversions and the special treatments of symbols such as the apostrophe. That means there need be no worry about what the respective conventions in the database are.

Another is that there are performance advantages on certain database platforms if the same query rump is executed with various parameters on a very frequent basis. Then the SQL parser doesn’t have to re-parse the query in order to plan the execution every time. Instead it falls back on earlier executions and only replaces the values at the placeholders.

Unfortunately, the type of parametrisation across various database platforms is anything but consistent. A Python database driver has at least one attribute named paramstyle which specifies which technique should be used.

 print(sqlite.paramstyle) 
# Ausgabe ist 'qmark' 
# also 
db.execute("INSERT INTO staff (person_id, lastname) VALUES (?, ?)", (51, "Mc'Donald"))

In the SQLite example, we first check which paramstyle is important. It is 'qmark', that is question marks. The insert statement demonstrates the use. Question marks are added for every position. In the correct order then, the values must be transmitted as a tuple as a second parameter in the function call. Type conversions run automatically, even for date objects.

In PostgreSQL, for example, there is a different format which also allows designated parameters.

 cursor.execute('SELECT * FROM staff WHERE person_id = %d', (99,)) 
# oder 
cursor.execute('SELECT * FROM staff WHERE person_id = %(person_id)d', {'person_id': 99})

Best practice = Only practice?

The title "Best Practice" is actually misleading. It should really be "Only Practice" instead. At first glance, parametrised queries seem to be cumbersome. Especially when the SQL statement is constructed dynamically using string concatenations, it is easy for parameters to become muddled, especially when the placeholders are only marked with question marks as is the case in SQLite. However, not having to worry about type conversions and symbols is, in comparison, a major benefit.

Normally, I would now be obliged to say that the security aspect – namely avoiding SQL injections – should be the most important reason to use this best practice and indeed is sufficient in and of itself.

In the day-to-day setting of data science, a setting in which many security precautions must be taken until the data can be accessed and where the application cannot necessarily be operated by potentially malignant users, the security aspect is rather a voluntary exercise. It is only very rarely in my years in this line of work that I have been in the situation where a potentially dangerous Python code has been accessible to anyone not directly involved in the project. An in these cases, everyone would have had direct access to the database anyway.

For anyone who wants to max out Python for SQL generation and create really powerful dynamic SQL queries, then they will be forced to turn away from the best practice at some point. It is only possible to parametrise value inputs. It is only possible to continue using table names and column names by using one or the other of the techniques described above dynamically.

In effect, I have to admit that I myself always use a mix of parametrised queries and string templates with {}. However, it is important to ensure that only authorised users obtain access to the script and the data. A first step in this direction is ensuring that everyone running a Python database script has their own user for the connection instead of there being a generic one.