SQLite CVSTrac

This article is intended for people who design interfaces between dynamic languages (a.k.a 'scripting languages") and SQL database engines. I want to describe three design ideas for such interfaces that can potentially make them much easier to use.

The genesis of these three design ideas is the interface between the TCL language and the SQLite database engine. The three ideas have been implemented there for some time and have been throughly tested and evaluated. These ideas have proven very helpful in improving the interface between TCL and SQLite.

The rest of this article will use the TCL/SQLite interface as an example to illustrate the ideas. But you should not read this article as advocacy of TCL or SQLite. (I often advocate both, just not here.) TCL and SQLite are used here merely to illustrate the ideas. I leave it to you, gentle reader, to translate these ideas into forms more suitable to whatever languages or database engines you prefer.

TCL Background

In order to understand the interface ideas described here, it is necessary to have a cursory understanding of TCL. You do not need to be able to write TCL code, but it is important that you be able to read it and get a reasonable idea of what it does.

A TCL program consists of zero or more commands. Each command is one or more tokens terminated by a newline. The concept is similar to a unix shell or a DOS command prompt. The first token of each command is the verb. Subsequent tokens are arguments.

Tokens are normally separated by white space. But if you want to include some whitespace as part of a token, you can put the text of your token in double quotes "...", or within curly braces {...}.

Variables whose names begin with $ are expanded after the line of text has been tokenized, but before the tokens are passed to the command. Except, variables within curly braces are not expanded. So, for example, if the variable "msg" contains the text "Hello World" then:

    puts "The message is: $msg"

would output "The message is: Hello World" because the variable is expanded since it is inside double-quotes, but in the command:

    puts {The message is: $msg}

the output would be "The message is: $msg" because inside curly braces, no variable expansion happens.

Those are the basic syntax rules of TCL. There are actually twelve rules that fully and completely define the entire syntax of TCL. The outline given above captures the main ideas and, though incomplete, is sufficient for this article.

SQLite Background

SQLite is a small-footprint, ACID, standalone, serverless, SQL database engine. The core of SQLite is written in ANSI-C and it widely used in applications built from dozens of programming languages.

The TCL interface for SQLite is object based. Each database connection is a separate object with a unique name. The interface consists of methods on that object.

A new SQLite database object is created using the following line of code:

    sqlite3 db filename.xyz

The first token, "sqlite3", is the verb. The second token is the name of the new database object. The object can be any unique name, but in this article I will always use "db". The third token is the name of the SQLite database file. In all subsequent commands, the name of the connection object ("db" in this example) will be the first token in the statement and the second token will be the method name. Additional tokens are arguments to the method.

Feature 1: Row by row result processing using a lambda procedure

Queries against the database are commanded using the "eval" method. The eval method takes two arguments. (This is simplification - the eval method has additional modes of operation that require different numbers of arguments, but those additional modes are beyond the scope of the current article.) The first method argument (the third token in the statement) is the SQL to be evaluated and the second method argument is a lambda procedure which is evaluated once for each row in the result set. For example:

   db eval {SELECT a, b, c FROM table1} {
     puts "a=$a"
     puts "b=$b"
     puts "c=$c"
   }

Remember that text between curly braces forms a single token in TCL. This is true even of text that spans multiple lines. So the code above consists of four tokens comprising a single statement. The first token is database connection object "db", the second token is the "eval" method name, the third token is the text of the SQL statement to be evaluated, and the fourth token is the source text of a lambda procedure that is evaluated once for each row in the result set of the query.

So the statement above begins evaluating the SQL statement. As each row of the result set is retrieved from the database, the values of each column are stored in TCL variables with the same name as the column, and the lambda procedure is run. This is repeated for each row until every row has been processed.

The interface automatically takes care of handling all the messy details of generating a prepared statement, binding arguments, running the prepared statement, and cleaning up afterwards. The interface keeps a cache of recently used SQL statements and their instantiations so that if the same SQL statement is run multiple times, it only has to be prepared once. In this way, users get the speed advantage of using prepared statements without the programming mess.

(In case you are wondering: TCL byte-code compiles and caches the lambda procedure the first time through the loop and reuses the same byte code on subsequent iterations.)

An important point is that the interface does not load the entire result set into memory. Each row is loaded separately, processed and discarded. The difference is negligible for most small queries, but if the result set contains many millions of rows, or if each row contains large CLOBs or BLOBs, the distinction can result in a huge memory savings. Processing row by row can make the difference between an application that runs smoothly and quickly with minimal memory and one that thrashes.

The lambda procedure can contain instances of the "break" or "continue" command that work as you would expect. "Break" causes the query to terminate. "Continue" terminates the current invocation of the lambda but the query continues normally with the next row.

The use of a lambda the honors "break" and "continue" commands not only saves in memory, but also makes the query interface seem more like a native part of the language instead of an extension. This may seem like a minor point. But experience suggests that an interface with native syntax is much easier and less error prone to work with.

Feature 2: Automatic Binding

SQL statements often need to be customized according to the current state of the program. For example:

   db eval "INSERT INTO table1(a) VALUES($x)"

As this statement is evaluated, the $x variable in the third token is expanded, resulting in a new SQL statement that inserts the value of variable x. This works as long as variable x contains a numeric value, but it is dangerous and potentially inefficient. If a malicious user can somehow control the value of variable x, then x might contain a string like:

   "1); DELETE FROM table1; SELECT random("

After $x is expanded the third token, the resulting SQL is:

   INSERT INTO table1(a) VALUES(1); DELETE FROM table1; SELECT random();

The is perfectly legal SQL but it does not work like you intended. This is the classic "SQL injection" attack. The way most SQL interfaces deal with it is to provide procedures that automatically quote values. For example:

   db eval "INSERT INTO table1(a) VALUES([sql_quote $x])"

In this example, the phrase "[sql_quote $x]" is a subcommand which is evaluated and whose results are substited in place of the command itself. Presumably the sql_quote command makes its argument safe as an SQL token - by surrounding it in single quotes and doubling any interior single quotes. This works, but it is troublesome in many ways:

The programmer has to remember to invoke the quoting procedure for each parameter that might be influenced by user input. Safe quoting is something that is easily and often forgotten.
The quoting procedure might contain a bug.
The quoting procedure takes time and memory. It might take a lot of time and memory if the variable being inserted is a large (multi-megabyte) CLOB.
After the variable is quoted, the SQL parser inside the SQL engine has to turn right around and parse the results, restoring the value to its original form. This seems wasteful.

The way to avoid this pain is to use named parameters. A named parameter is like a variable in the SQL statement. Most interfaces work by requiring the programmer to "prepare" the SQL statement, then issue various calls to "bind" values to the parameters. This all works, but it is burdensome. The TCL interface has a better way.

In TCL, one has merely to specify TCL variable names inside the SQL statement that the values of those variables are automatically inserted as bound parameters. For example:

   db eval {INSERT INTO table1(a) VALUES($x)}

Notice here that the SQL statement is enclosed in curly braces instead of double quotes. This means that TCL will not attempt to expand the variable x prior to sending the argument to the eval method. SQLite understands $x as a parameter. The TCL interface automatically prepares the statement, asks the statement what parameter names it contains, looks up the value of TCL variables with the same name, and binds those values to the parameters. Automatically. Transparently.

SQLite supports named parameters in several formats (for compatibility with various databases): "$AAAA", "@AAAA", and ":AAAA" where AAAA is the name. The $AAAA form is specific to SQLite and is designed to support exactly this mechanism. The @AAAA form is MS-SQL Server, I think. I'm not sure where the :AAAA form comes from. The example above used the $AAAA form since that makes the SQL look more like the underlying scripting language. But you could have just as well have written:

   db eval {INSERT INTO table1(a) VALUES(@x)}

   db eval {INSERT INTO table1(a) VALUES(:x)}

To make this happen, the database engine interface has to support the ability for the interface to query the names of a prepared statements parameters, and the host scripting language needs to have the ability for the interface to find variable values. (Both are modest requirements.) The interface then locates the names of all parameters, translates the names into script language variable names if necessary, looks up the variable values and binds those values to the named parameters.

In the TCL implementation, the only a pointer to the variable value is bound into the SQLite statement. In this way, no extra copies of the variable values are made, which can be a big performance win for large BLOBs and CLOBs.

Feature 3: Automatic ROLLBACK Following An Exception

You can, of course, create a transaction using the SQL "BEGIN" and "COMMIT" statements. Like this:

   db eval BEGIN
   # lots of code here involving multiple
   # SQL statements as well as other actions.
   db eval COMMIT

The problem here is that if an exception occurs anywhere in the intervening code, the COMMIT is skipped and your connection is left with an open transaction. You could, of course, create a lot of code to catch and deal with the exception:

   db eval BEGIN
   if {[catch {
     # Transaction code here
   }]} {
     db eval ROLLBACK
   } else {
     db eval COMMIT
   }

But doing that clutters the code and is error-prone. The TCL interface provides a shorthand:

   db transaction {
     # Transaction code here
   }

The "transaction" method starts a new transaction. The argument to the transaction method is a lambda procedure that is executed exactly once. At the conclusion of the lambda, the transaction is automatically committed. But if the lambda throws an exception, the transaction is rolled back and the exception is propagated outward.

You cannot have nested transactions in SQLite. The transaction method takes this into account. If a transaction is already pending when the transaction method is run, then the transaction method just runs the lambda without trying to start a new transaction and without trying to commit or rollback afterwards. So even though transactions do not nest in SQLite, they do if you use the transaction method of the TCL interface.

Summary

The best interfaces is those that the user never notices. They are "transparent". A good interface stays out of the users way - freeing up brain cycles from thinking about syntax so they can be reused to work on the users end problem. The three interface ideas above help to make the TCL interface to SQLite much more transparent. If you build interfaces between dynamic languages and SQL engines, then I think you will do well to make use of the ideas above in your designs.

sqlite - Dynamic Language Interface To Sqlite