DuckDB is a relational OLAP store. If you want to do transformations on relation...

neeleshs · on Feb 10, 2023

I am not sure I understand the first comment very well. Are you saying that instead of duckdb use modern data stack? Because DBT and DuckDB don't seem to contradict, but can work together. FWIW, I think the only important breakthrough in the "modern data stack" is really dbt. The rest, nothing modern about it

benjaminwootton · on Feb 11, 2023

It’s more a comment on where the market is at rather than a recommendation. I’m not saying it is bad tech, but I don’t see it’s niche.

If you have a few billion records and you want to filter, join, aggregate them using SQL then DBT against an OLAP server solves that issue so well that it doesn’t leave much white space for DuckDB.

I mentioned modern data stack because when you have SaaS, low code, consumption based billing, open source etc then it treads even more on the DuckDB value prop. DuckDB would have been great if Oracle was my only choice, but when I have Snowflake and Clickhouse in the toolbox it is a tougher market for them to carve out a niche.

hobs · on Feb 11, 2023

Also fwiw its always lowercase - dbt.

89vision · on Feb 10, 2023

> working with SQL is nicer than Pandas

Really? I prefer working with dataframe apis. You get a nice sql-like paradigm plus all the control structures of the runtime.

camgunz · on Feb 11, 2023

Databases are just much, much faster than Pandas, and that's before you start factoring the extraction and loading of data. I treat Pandas as a last resort when I can't do something in SQL, generally this is something like integrating with external services or running recordlinkage.

RobinL · on Feb 11, 2023

If you're curious, I've written a FOSS record linkage library that executes everything as SQL. It supports multiple SQL backends including DuckDB and Spark for scale, and runs faster than most competitors because it's able to leverage the speed of these backends: https://github.com/moj-analytical-services/splink

camgunz · on Feb 11, 2023

Oh hot tip! Thank you! Love the blog btw

jammycrisp · on Feb 11, 2023

You might be interested in checking out Ibis (https://ibis-project.org/). It provides a dataframe-like API, abstracting over many common execution engines (duckdb, postgres, bigquery, spark, ...). Ibis wrapping duckdb has pretty much replaced pandas as my tool of choice for local data analysis. All the performance of duckdb with all the ergonomics of a dataframe API. (disclaimer: I contribute to Ibis for work).

je42 · on Feb 11, 2023

Interesting ! One newbie question, how does ibis differ from sqlalchemy ?

89vision · on Feb 13, 2023

sql alchemy is an orm, where ibis looks to be a dataframe api that is sort of a dsl over sql. It doesn't try to map relational domains to an object oriented paradigm like sql alchemy does

wenc · on Feb 10, 2023

SQL is much nicer for anything non-trivial. Pandas methods get unwieldy for complex aggregations.

Also Pandas methods are imperative so cannot be optimized. SQL is declarative so it can be optimized to the hilt and DuckDB is faster than Pandas in almost all cases, even on Pandas data frames themselves! (partly due to vectorization).