Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

DuckDB is a relational OLAP store. If you want to do transformations on relational data using SQL then I think nowadays you would look at the modern data stack and do it with DBT.

If you have genuinely big and unstructured data then of course you need a cluster and would reach for Spark.

If you have smallish data then maybe DuckDB has a role because working with SQL is nicer than Pandas. But a lot of time you actually need the complexity of Pandas to do the transformation you need.

DuckDB is neat but I still can’t quite convince myself of a killer use case.



I am not sure I understand the first comment very well. Are you saying that instead of duckdb use modern data stack? Because DBT and DuckDB don't seem to contradict, but can work together. FWIW, I think the only important breakthrough in the "modern data stack" is really dbt. The rest, nothing modern about it


It’s more a comment on where the market is at rather than a recommendation. I’m not saying it is bad tech, but I don’t see it’s niche.

If you have a few billion records and you want to filter, join, aggregate them using SQL then DBT against an OLAP server solves that issue so well that it doesn’t leave much white space for DuckDB.

I mentioned modern data stack because when you have SaaS, low code, consumption based billing, open source etc then it treads even more on the DuckDB value prop. DuckDB would have been great if Oracle was my only choice, but when I have Snowflake and Clickhouse in the toolbox it is a tougher market for them to carve out a niche.


Also fwiw its always lowercase - dbt.


> working with SQL is nicer than Pandas

Really? I prefer working with dataframe apis. You get a nice sql-like paradigm plus all the control structures of the runtime.


Databases are just much, much faster than Pandas, and that's before you start factoring the extraction and loading of data. I treat Pandas as a last resort when I can't do something in SQL, generally this is something like integrating with external services or running recordlinkage.


If you're curious, I've written a FOSS record linkage library that executes everything as SQL. It supports multiple SQL backends including DuckDB and Spark for scale, and runs faster than most competitors because it's able to leverage the speed of these backends: https://github.com/moj-analytical-services/splink


Oh hot tip! Thank you! Love the blog btw


You might be interested in checking out Ibis (https://ibis-project.org/). It provides a dataframe-like API, abstracting over many common execution engines (duckdb, postgres, bigquery, spark, ...). Ibis wrapping duckdb has pretty much replaced pandas as my tool of choice for local data analysis. All the performance of duckdb with all the ergonomics of a dataframe API. (disclaimer: I contribute to Ibis for work).


Interesting ! One newbie question, how does ibis differ from sqlalchemy ?


sql alchemy is an orm, where ibis looks to be a dataframe api that is sort of a dsl over sql. It doesn't try to map relational domains to an object oriented paradigm like sql alchemy does


SQL is much nicer for anything non-trivial. Pandas methods get unwieldy for complex aggregations.

Also Pandas methods are imperative so cannot be optimized. SQL is declarative so it can be optimized to the hilt and DuckDB is faster than Pandas in almost all cases, even on Pandas data frames themselves! (partly due to vectorization).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: