Some background
This story is related to a Django project that started a few years back, where the focus was to store groups of numbers. These numbers persist as number_start
and number_end
in a database table using integer fields
. Performance was not an issue in the beginning, as the team concentrated right from the start on minimizing the communication with the database. Django has some handy mechanisms such us bulk creates
, bulk updates
and prefetch
that can solve most performance related issues.
Once we got closer to production, it was time to make a test with large amounts of dummy data. This had a dramatic impact on the performance. A very specific query that was used, implied longer execution time, which meant higher production costs. The operation itself was not that complicated, it queried for ranges that overlapped with a given range.
Suppose we have the following model:
class MyNumber(models.Model):
number_start = models.IntegerField()
number_end = models.IntegerField()
and the following fixture
[
{
"model": "my_app.my_number",
"pk": 1,
"fields": {
"number_start": "10",
"number_end": "20"
}
},
{
"model": "my_app.my_number",
"pk": 1,
"fields": {
"number_start": "21",
"number_end": "30"
}
},
{
"model": "my_app.my_number",
"pk": 1,
"fields": {
"number_start": "31",
"number_end": "50"
}
},
{
"model": "my_app.my_number",
"pk": 1,
"fields": {
"number_start": "51",
"number_end": "75"
}
}
]
To get all number between 5 and 35, we can use the following filter:
MyNumber.objects.filter(
Q(number_start_gte=5) | Q(number_end_lte=25),
)
The result would be [10, 20]
and [21, 30]
.
Initially this operation was not a big deal. On 1 million rows it took about 5 seconds and it was getting even slower.
First Optimization
It is always a good idea to make sure indexes are defined, so the first idea was to enable B-Tree Indexes on each field.
class Number(models.Model):
number_start = models.IntegerField(db_index=True)
number_end = models.IntegerField(db_index=True)
It worked quite well for 1 million rows. Increasing the test data to 4 million rows caused the query to load in about 1 Seconds and on 8 million rows it took up to 6 seconds.
Something that seems to cover this use case pretty well are Range Types. Luckily there is support in Django for these fields that are available only in PostgreSQL.
Second Optimization
A new field has been introduced to the model. This way the compatibility does not brake with other modules that use the existing schema.
from django.contrib.postgres.fields import IntegerRangeField
class Number(models.Model):
number_start = models.IntegerField(db_index=True)
number_end = models.IntegerField(db_index=True)
number_range = IntegerRangeField(null=True)
The field number_range represents the same thing as the two initial fields, but with a dedicated type that is supported by the database and will bring some extra functionality. In order to have this field updated, we need to extend the save method
to store the new range after creation or modification. The bulk_create
in the models manager also needs to be extended to achieve the same result for bulk operations:
from psycopg2.extras import NumericRange
from django.contrib.postgres.fields import IntegerRangeField
class NumberManager(models.Manager):
def bulk_create(self, objs, *args, **kwargs):
for item in objs:
item.update_range()
super().bulk_create(objs, *args, **kwargs)
class Number(models.Model):
number_start = models.IntegerField(db_index=True)
number_end = models.IntegerField(db_index=True)
number_range = IntegerRangeField(null=True)
def update_range(self):
self.number_full = NumericRange(
self.number_start_full,
self.number_end_full,
"[]",
)
def save(self, *args, **kwargs):
self.update_range()
super().save(*args, **kwargs)
A migration is also required to fill up the data for all existing entries. Making the update on each item will take a lot of time, and I would recommend this only on small database, which is not the case here. Another idea is to use a RunPython
migration with an update statement such as MyNumber.objects.all().update(number_range=NumericRange(F('number_start_full'),('number_start_full'), "[]"))
, but the NumericRange
is not compatible with the F
expression of Django, so the only solution left is to use a RunSQL
migration:
from __future__ import unicode_literals
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('my_app', '0002_add_number_range'),
]
operations = [
migrations.RunSQL(
"UPDATE my_app_my_number SET number_full=intrange(my_app_my_number.number_start, my_app_my_number.number_end, '[]')",
migrations.RunSQL.noop,
)
]
The queryset can now make use of the special operators for ranges and changes to MyNumber.objects.filter(number_range__overlap=NumbericRange(5, 25, '[]'))
The
[]
operator is required, as by default PostgreSQL uses[)
, which means the lower end is included and the upper end is excluded internally. Make sure to get familiar with the range syntax and the way the operators work.
Adding an index is the final missing piece. The recommendation is to use GiST or SP-GiSt
from django.contrib.postgres.indexes import GistIndex
class MyNumber(models.Model):
...
class Meta:
indexes = [GistIndex(fields=['number_range'])]
This way the query is now settled to about ~50ms and it's not influenced that much by the amount of data.
Conclusion
This feature is powerful and it would be quite handy to implement this right from the start. However, adding it later to the code base is also possible as seen above. Having a good test coverage is a big plus when playing around with new concepts.
Advantages
- Increases query performance
- Querysets are easier to read
Chalenges
- Migrations need to be implemented via RunSQL
- Multiple parts of the model need to be extended
- Requires some time to get used to syntax and range operators