Recently I had this need to create a new layer in my personal DW. This DW runs in a postgreSQL and gets data from different sources, like grocy (a personal grocery ERP. I talked about how I use grocy in this post), firefly (finance data), Home Assistant (home automation).
So, I’ve been using dbt to organize all these data into a single data warehouse. Here’s what I’ve learned so far:
FAQ
Is DBT an ETL tool like Pentaho Data Integration (PDI) or SQL Server Integration Services (SSIS)?
Short answer: No.
Long answer: I can deliever the same result depending on your where your are are and what you want to do with it.
Say you want to create a new schema in postgreSQL for a DW where postgreSQL will read the data from the raw schema and will create new tables with standardized names, in a different modelling as the operational (transaction) model. Yes, you can do that with dbt. In fact, there is much more to do.
The point here is, dbt won’t be able to read or write data outside your database (a thing that PDI and SSIS will easily do). That’s why you cannot say dbt is not an ETL tool. It doesn’t extract and load the data to a different logical place.
This is only one of the situations where dbt doesn’t replace an ETL tool. In contrast, if your data is already in the same place, you can easily transform it, allowing you to use SQL language (that is easier to understand for data analyst and data engineers) and yaml/yml language.
Is dbt fast? How is the performance? What are the requirements to run dbt compared to other tools?
Short answer: it really depends, but dbt tend to run about the same or slower.
Long answer: considering you have a situation where you can use dbt to replace your ETL tool like PDI or SSIS, dbt will perform about the same because it doesn’t really uses a complex system to transform data. It will use more the structure where the data is.
Also, a big difference is that when using an ETL tool, you will be required a different structure to run the ETL tool (with good amount of RAM and CPU) to process the data. DBT will not use this structure, as it depends more to the structure of the database. If you’ve got a great server to run an ETL tool, dbt will probably perform slower than the ETL tools.
For example, if you want to materialize the data as a view (so dbt won’t copy the data to a new table, but will create a new view with the instructions you will give), depending on the complexity of the query, dbt won’t perform any better than ETL tool.
Maybe you won’t have the process of writing the data and that will play a big chunk of the work, but it really depends on so many things: what technology you are using to store the data, what tool your are using to write, the write method. In the other hand, with dbt you don’t have to worry about additional storage since dbt will create a view.
If you choose to materialize the data as a table, the transformation will be about the same.
How is the productivity compared with other ETL tools?
With dbt you can track changes easier, because you’ve got git. The project files are yaml/yml and sql, so it’s really easy to make and track changes. It’s also easier to work with CI/CD. It’s also easier to create tests, since you can easily set up some default and custom tests when you are setting up the sql for a specific table. Say you want to guarantee a column has no null values, you just add this test to your model under the name of table (in this case orders), under the column order_id.
version: 2
models:
- name: orders
columns:
- name: order_id
tests:
- not_null
You can do pretty much the same thing with unique, accepted_values, relationships or even create custom generic tests like testing if a number is odd or not.
Conclusion
This is what I’ve learned about dbt. Some of the information were not so easy on the internet so I thought it was a good idea to share them with you.
Let me know in the comments about your experience with dbt and how you use it.
Leave a Reply