dbt — There is data at the end of the tunnel
dbt is a command line tool which does the T in ELT (Extract, Load, Transform) processes and it is a powerful technology to help engineers and data scientists to transform data by simply writing select statements and SQL validations in a structured fashion. In my previous job at Deliveroo, I was working with dbt (data build tool) and I’d love to share some learnings and why we should treat SQL as software such as any other micro service or Lambda function implemented using a programming language.
dbt
The diagram below shows how dbt fits into a tech stack. Essentially, it takes data from multiple different sources such as databases, services and Lambda functions, loads into a standard format and then transforms that data into a format to best suit particular use cases within a company.
There are two main concepts we need to introduce before going further: Model and Materialisation.
Model
A single .sql file which contains a single select statement that produces a single table (or other materialisation). A model can act either to transform raw data into another form, for example a table for analytics or model training. More often, it is an intermediate step in such a transformation. We can create a hierarchy between models and join them up as we want. Our job as engineers is to create models, connect them, and then leave the work of creating the actual dataset into the database to dbt. You can find further details about models in dbt — Models.
Materialisation
Materialisations are the methods used to realise the models into the database. Models are materialised using views, tables and other methods depending on the underlying database. There is a large number of other possible refinements, including incrementally-loaded tables, where data is merged into the existing data, and grows with each time dbt runs. You can find more details of this and the other possibilities in dbt — Materialisation.
SQL is Software
On Building a Mature Analytics Workflow, Tristan Handy (the CEO of Fishtown Analytics — the company behind dbt) outlines how the same techniques that software engineering teams use to collaborate on the rapid creation of quality applications can apply to analytics code. In this section, I am going to summarise how we can adopt some of these techniques and embrace a software engineering mindset when working with SQL.
Version Control
All dbt models should live in a repository, from my experience having all SQL code hosted on a shared repository helps to improve collaboration between engineers and data scientists. Essentially, any changes to datasets follow the same workflow of any other service maintained by engineers on production. That is, if you want to add a new column or apply a new transformation over the dataset, you have to go through the whole process of creating your development branch; opening a PR and having eyes from your colleagues before merging anything to production.
Quality Assurance
dbt provides simple mechanisms to validate the quality of datasets, after all models (SQL behind the scenes) must receive the same treatment that we give to code implemented using a programming language such as Ruby, Python and Go. That is, we should try to have good code coverage to ensure transformations work as expected on production.
We can go from simple schema validations such as checking if a column has null values; referential integrity where we validate the accuracy and consistency of data within a relationship; to more elaborated validations through the creation of assertions using SQL queries. Last but not least, you can find plenty of out-of-the-box validations on dbt-utils such as equality to compare the number of rows between 2 models and expression_is_true to validate expressions of a given model.
Documentation
When we work in a cross-functional team it is really important to ensure that everyone is on the same page and nothing better than adding descriptions to your models and columns. If you really document your models, there is gold at the end of the rainbow, dbt ships a built-in documentation website for your project where you can navigate through all models and see a graph of dependencies between them. You can find further details on dbt — Documentation.
Modularity
We write Ruby, Go and Python services taking into account modularity, some benefits are:
- Less code to be written.
- A single piece of code can be developed for reuse, eliminating the need to retype the code many times.
- It makes the code shorter, simpler and easier to maintain.
- It facilitates debugging errors, as they are localised to a specific area in the code.
Models or transformations are also made to be reused, we just need to learn a little bit about ref — the most important function in dbt according to its creators. Basically, we can reference one model within another in order to reuse them to build different datasets. dbt also makes it easier to have DRY (Don’t Repeat Yourself) code through the use of macros, which consist of SQL snippets that can be invoked like functions from models.
Environments
Out of blue, all Periscope and Looker dashboards of the company are broken, oh no! Someone was just trying to rename a column of that beautiful table. That’s the kind of situation that might happen when you do not have multi-environments. As friendly engineers, we should have multiple environments to develop and test things properly before we roll out to production any piece of code (yep, SQL is software after all). To facilitate this dbt provides a simple way to set different environments through a YML file (profiles.yml).
I suggest having at least three environments: development, staging and production. Engineers should have their own schema to host models in a development environment. Once they are happy with changes and pull requests are approved by their peers, engineers should merge the changes to staging. And, the very last step is to merge changes from staging to production.
Conclusion
I have talked a bit about dbt and how we can apply engineering techniques to SQL code in order to improve team collaboration and ensure we deliver reliable datasets for ML training and analytics. The most important takeaway is that we should look after our analytics code as we do for any other applications implemented in a programming language.
What about you? How have you been treating your analytics code? I’d love to hear experiences from other engineers working with SQL and dbt :)