Published by Ostap Cherkashin on 2010 12 27.
The first cut of the Bandicoot database (v0) provides enough functionality to start building some applications. Though, as with any new system, there are several important questions arise: stability, performance, …? This blog post is about performance of the system. With regard to stability I can only say that even though there are quite some structured tests available, it is still not recommended for production use.
With the Bandicoot application developers control the execution path. This is fairly different from other databases where this responsibility is split between a human (programmer) and another "living organism" (query optimizer). It happens that sometimes they misunderstand each other and as a result we get two beliefs:</p>
I will try to address the first one here. With the second one (and having the v0 tag comment say "brute force preview") I will try to avoid the discussion for a little while ;-)
In relational algebra there is only one data structure (relation 1) and several operators (select, join, etc. 2). Relations represent your data and operators take these relations as input and produce different ones as an output. From the performance perspective one of the most important properties of a relation is the number of tuples (or rows) in its body. Execution time of an operator heavily depends on this number (it is also used to express the algorithmic complexity below). In order to improve a function performance one should try to reduce the data sets passed from one operator to another as early as possible. This is a fairly intuitive idea and it also greatly impacts the performance. To help with this and to facilitate the computation reuse the Bandicoot introduced temporary variables 3. Apart from improving performance they also make code more readable.
The table below lists the complexities for different relational operators in the Bandicoot v0. I will use #n as the number of tuples in a relation variable n.
|join||m * n||O(#m * #n)|
|minus (semidiff)||m - n||O(#m * #n)|
|union||m + n||O(#m + #m*#n)|
|select||n select(x < 0)||O(#n)|
|extend||n extend(x = 1)||O(#n)|
|project||n project(x, y)||O(#n^2)|
|summary||(m, n) summary(x = max(y, 0))||O(#m * #n)|
As you can see most of the algorithms are of quadratic nature, and thus the limits of v0 are quite low in terms of number of tuples. I am working on the patch to improve this situation, and meanwhile you can master the temporary variables which is a good practice in any case.
To turn the formulas above into the actual numbers you can use the performance tests available with the system source code. One of the interesting ones is perf/relation. It shows the execution times for some of the workloads on different relational operators. Here is the output from a run on my laptop (core i7 @2GHz and -Os gcc option). Note, that all of the test tuples have two attributes (string and integer) and are pre-generated before each test execution.
$ cd bandicoot $ ctl pack $ bin/test/perf/relation write 100000 tuples in 18ms read 100000 tuples in 21ms write 1000000 tuples in 135ms read 1000000 tuples in 175ms join 1000x1000 tuples in 20ms join 10000x10000 tuples in 1786ms semidiff 1000x1000 tuples (res=500 tuples) in 15ms semidiff 10000x10000 tuples (res=5000 tuples) in 1533ms project 1000 tuples from 1000 in 9ms project 10000 tuples from 10000 in 876ms select 1 tuples from 100000 in 11ms select 1 tuples from 1000000 in 97ms union 1000x1000 tuples (res=1500 tuples) in 20ms union 10000x10000 tuples (res=15000 tuples) in 1752ms extend 100000 tuples in 37ms extend 1000000 tuples in 372ms sum 1000x1000 tuples in 20ms sum 10000x10000 tuples in 1817ms
The first rows show the read/write performance for a data volume (I have an SSD drive + there are no fsync/fdatasync calls). The rest are the execution times of the relational operators.