Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
DuckDB: Querying JSON files as if they were tables (duckdb.org)
294 points by cmdlineluser on March 3, 2023 | hide | past | favorite | 80 comments


I think I write this under every article about DuckDB, but it's become an indispensable tool for me. I used to abuse Excel because going from Excel to a script to process some data was too much friction, but with DuckDB the friction is gone: loading CSV and Parquet (and now JSON) files is a snap, you can create and persist any tables you want, the SQL dialect has lots of useful sugar.


I've just added support for duckdb to the free SQL tool I make: https://www.timestored.com/qstudio/help/duckdb-sql-editor It allows click to open, browsing tables etc. If you have some time, I would really appreciate feedback from a real user, other than myself.


Tried to open DuckDB file with one table with 890K rows and about 30 columns. Process stalled, had to kill it (MacOS). DuckDB cli opens it in a snap.


:( Thanks for trying. The only thing I can think is perhaps a version conflict (requires 0.70) but I would have expected it to say, rather than freeze. I will try to recreate the problem. Thanks.


Yes, I think I have 0.60, good point. I'll update the version and try again later.


The benefit of “in process” isn’t really clicking for me.

We had Pandas or similar if we wanted to load and transform some data in memory. SQL is nicer than Pandas APIs but not sure that’s a killer feature?

If we have a lot of data and multiple people working with it then it makes sense to centralise it in a database or warehouse where it’s then easy to access via SQL anyway.

We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart. There’s a cost benefit here because we don’t need to run a server which could be significant.

It’s nicely implemented but I’m not seeing a big gap that it fills?


Just to agree with @orthozerox, it's not a 'killer feature' but about lowering the friction.

Lots of users: (a) mentally align better with SQL than pandas APIs, regardless of whether they know both or not (b) want decent performance on their analyses, which they aren't getting from pandas, and won't get from many OLTP-databases they're using over-the-wire (c) want ease of accessing parquet and csv locally and remotely with minimal development overhead. it's super simple with duckdb.

Nonetheless, some of the other things you pointed out are some tradeoffs. We're building a serverless cloud capability at motherduck on top of duckdb in order to address some of these and optimize compute and storage based on data locality, bandwidth and the need for collaboration amongst multiple people.


Focusing on "in process" for the moment, I believe that where duckdb can really shine is in what I like to call last-mile-analytics.

Sure, the complete and most up to date version of your data can be stored in a warehouse/database but when you're potentially slicing/dicing/filtering/sorting/exploring/munging data, it can get quite expensive to have your warehouse/database servicing these requests. Even if it's a cloud/modern warehouse like Snowflake. This would be especially true if you're doing analytics on a non-OLAP database.

With duckdb you could pull down the subset of data you're working on from your warehouse/db/datalake, and then perform the last-minute-analytics "in-process". That might be in your notepad, browser (WASM), etc... As a result you can expect some pretty amazing query performance since it's all happening locally, on a subset of the entire data that you've selected.

Then of course if you still wanted to use pandas you can point it at duckdb and allow pandas to fill in the deficiencies of SQL (while still potentially pushing-down some SQL to duckdb). Then of course you can take those dataframes and push them right back into duckdb instead of writing back out disk.

> We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart.

If you're files on S3 in a sensible form (hive, iceberg, etc), then you can also use duckdb to pull only the data you need from your bucket and work on it locally.


Well, I know SQL much better than I know either Scala+Spark or Python+Pandas. DuckDB is also much easier to set up, being a self-contained JAR. I also like how having a DB file means I don't have to worry about losing in-memory dataframes or leaving random temp files all over the filesystem.

In short, it's not about having a single killer feature, but about being a low-friction solution to my problems.


To me it does fill a big gap.

It’s a local columnar engine that I can use inside a Jupyter notebook. This lowers my cost of iteration tremendously.

Yes I can query data from Postgres and munge with Pandas.

But what if I need to iterate on a large set of parquet files (mine is 200gb on my local machine, Hive partitioned, over a billion records) and munge them with complex SQL with a high perf engine? And seamlessly join with other smaller local datasets (there are always smaller datasets that contain metadata) in CSV, Pandas and JSON format in the same SQL statement?

This is a surprisingly common use case in a lot of data science work and prior to DuckDB you could not do it easily, ergonomically or quickly with a single tool. The authors of DuckDB talked to lots of data scientists to learn their pain points and the final product shows that they really listened well.


To me this is a lot simpler than Pandas as it’s just SQL but it’s SQLite with a columnar engine etc etc


You can also do this with AlaSQL.js which apart from being able to run SQL against JSON and also works with CSV and XSLS, just include it using a script tag or import it as a node module.


Any chance you’ve tried clickhouse local? I was thinking it might be a good fit but haven’t used duckdb at all so I might be missing out on big differences.


No, I haven't, but it looks like it's a standalone console application, while DuckDB is in-process, like SQLite, and lives inside its JDBC driver. This means I can use it inside any compatible GUI and get stuff like schema browsing and IntelliSense out of the box. Since I have DBeaver open all day anyway, DuckDB is always a tab away.


Clickhouse local is very fast and good for many purposes but DuckDB is equally fast (or faster in some cases but this is a moving target) and supports a wider range of SQL that people like me often use.

DuckDB is great for data scientists and people who need to run complex analytic queries in their Jupyter notebooks and their Python prod code on local or S3 hosted data.


What’s the workflow with JDBC? Say you connect Tableau to it. How would you populate it with data?


My workflow is that I have a connection to c:\temp\scratchpad.db in DBeaver that I populate via

  create table some_data as select * from 'c:\temp\whatever.csv'
or

  create table other_data as select * from read_parquet('c:\temp\000000_0')

which I then can transform using SQL and export the result into CSV, Parquet or SQLite when needed.


I'm recent convert too. I used to work with SQLite for querying datasets but for my usecase DuckDB is much faster plus CLI is nicer to work with.


Welcome to the gang! :)

https://github.com/multiprocessio/dsq#comparisons

Realistically though aside from the variety of input formats that DuckDB doesn't (yet) support, I think most people should probably use DuckDB or ClickHouse-local. Tools like dsq can provide broader support or a slightly simpler UX in some cases (and even that is obviously debatable). But I think the future is more the DuckDB or ClickHouse-local way.

dsq may end up being a frontend over DuckDB some day.


I made a docker image with a number of extensions already installed and enabled so you can start using DuckDB with the lowest friction.

`alias dckr='docker run --rm -it -v $(pwd):/data -w /data duckerlabs/ducker'`

then `dckr` gives you a DuckDB shell with PRQL, httpfs, json, parquet, postgres, sqlite, and substrait enabled.

For example, to get the first 5 lines of a csv file named "albums.csv", you could run it with PRQL

```dckr -c 'from `albums.csv` | take 5;'```

https://github.com/duckerlabs/ducker


This is really cool!

With their Postgres scanner[0] you can now easily query multiple datasources using SQL and join between them (i.e. Postgres table with JSON file). Something I previously strived to build with OctoSQL[1]. There's even predicate push-down to the underlying databases (for Postgres)!

It's amazing to see how quickly DuckDB is adding new features.

Not a huge fan of C++, which is right now used for authoring extensions, it'd be really cool if somebody implemented a Rust extension SDK, or even something like Steampipe[2] does for Postgres FDWs which would provide a shim for quickly implementing non-performance-sensitive extensions for various things.

Godspeed!

[0]: https://duckdb.org/2022/09/30/postgres-scanner.html

[1]: https://github.com/cube2222/octosql

[2]: https://steampipe.io


To answer myself, I've found a project which enables extension development for DuckDB using Rust[0].

[0]: https://github.com/Mause/duckdb-extension-framework


Consider polars for rust. Much, much faster with fewer resources than Duckdb or datafusion in my experience.


Polars is a dataframe library, no? That's a quite different use-case.


oh no.. it does lazy query optimization, out of core... most if not all of the good stuff.


It's not a CLI SQL engine though, is it?


oh no.. for that you'd want datafusion


After having tried their PostgreSQL plugin I feel like it's a bit too early to use in production.

Very little amount and unclear pushdown filters are one of the issues, not handling certain data types and thus not being able to scan the table (even if the column in question isn't used) is another.

I think that DuckDB is also missing a PostgreSQL logical replication driver to continuously replicate a subset of tables you want to run stats on.

Syncing the full table every time is too slow.


It's a very exciting time to be working in this space! Going beyond structured databases and file formats like JSON/CSV, there are also systems that can query APIs, source code, ML models, etc.

My own Trustfall query engine is one of them: https://github.com/obi1kenobi/trustfall

For example, you can query the HackerNews APIs from your browser: "Which Twitter/GitHub users comment on stories about OpenAI?" https://play.predr.ag/hackernews#?f=1&q=IyBDcm9zcyBBUEkgcXVl...

One of its real-world use cases is at the core a Rust semver linter: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...


It's definitely cool to be able to query data in place instead of inserting it into a table. You can use clickhouse-local to do the same thing with JSON files (and with dozens of other data formats): https://clickhouse.com/blog/worlds-fastest-json-querying-too...


What's funny, is I've wanted something similar as a feature for "Azure Data Studio" that can open/use a CSV file and query it as a sqlite table. Basically an auto-import to a temp or in-memory db/table that you can then query against.

Would just be a nice gui feature to have.


That is exactly https://superintendent.app (disclaimer: I'm the creator)


Ah I was looking for exactly this the other day. I'm try to build a git based interface to our BI tool so that we can get config for our reports in source control instead of configuration in a db.

Was looking for something to read json files which will house the config via SQL, i.e. a human readable db as an alternative to what the BI tool is using for it's config persistence.

Will give DuckDB a go, thanks for posting!


I’m working on an open source BI alternative called Evidence which works nicely with version control and CI/CD.

If that’s of interest, you can read our launch HN here: https://news.ycombinator.com/item?id=28304781

One of our community members has built a pretty cool Duck DB + dbt + evidence data stack that you can run entirely in GitHub codespaces. He’s calling it modern data stack in a box.

You can see that repo here: https://github.com/matsonj/nba-monte-carlo


Are you talking about metabase by any chance?


Haha indeed! I've started with a very basic Ruby api client that can read and create dashboards.

My plan is to poc a tool that allows you to edit metabase config as files and secondly something that can replicate cloud instances to other environments like local docker image or staging instance


Mind you this isn’t appropriate for most cases. But I love the idea of “you start with text file. You end with text file. All the database stuff, indexes, etc. are just a detail.”

Often I find that the database wants to be the authority and that makes working with different formats a bit uncomfortable.


We’re currently building real-time apis backed by terabytes of compressed parquet… hundreds of billions of ‘rows’… in exactly this fashion using polars. It amazes us at every turn.

Join us and help!


What project?

Do you mean polars reading Parquet into DuckDB to process that amount of data?


Internal. We're using Polars as the query engine to effectively query that data statically at rest (more accurately, mmap'd on disk in arrow ipc format)


What does this look like in practice? Using the filesystem as a database?


GNU Recutils https://www.gnu.org/software/recutils/

is a good example of an actual database that uses plaintext files in your filesystem.

I can see the argument that doing this with JSON is better (or worse), but regardless, Recutils is an interesting idea that i wish more people knew about. I can imagine a lot of cool things emerging if people would iterate on the idea.


Recutils is great, but it needs a rewrite, I think.


Anything that stores data on a computer is essentially a database. It's all about representation and what kinds of operations you prioritize for performance.


Apache Spark / Databricks is an example of this. Parquet files are stored in folders. A folder is assumed to hold one dataset split into multiple files based on specified partition criteria. The VMs read the necessary files into memory and then operate on it.


Isn't linux a good example of this? Everything is a file.


> If your JSON file is newline-delimited, DuckDB can parallelize reading.

I'd like to understand more about what that means. Does it use multiple threads each reading from a different position in the file?


DuckDB will use multiple threads for reading the same file. Each thread will read different parts of the file, but the output will be in the order that the file came in due to DuckDB’s order preserving parallelism.


I'm impressed to all heck how fast this is, and how intuitive!

I tried it against a large-ish (hundreds of mb) log file and the query was simple to write while spitting out the result .. very quick.

Impressed!


Awesome stuff!

I was about to comment about how this is all fantastic stuff, but I've really found reading through duckdb docs quite challenging. But for these json table functions, documentation looks much better.

https://duckdb.org/docs/extensions/json

Need to spend some more time digging in, but this json functionality combined with some kind of file partitioning (Hive or hive-like) looks promising for some of my use cases.

Incidentally, the documentation for hive/parquet stuff is a good example of what I'm talking about above. For the `parquet_scan` function, where can i see all of the possible function parameters? Where can get more information about the specifics of `FILENAME`, `HIVE_PARTITIONING`, etc?


This is cool, but would like to give some higher level context about querying JSON files.

JSON is a row based file format. It doesn't allow query engines to skip rows or skip columns when running queries, so all data needs to get read into memory. That's really inefficient.

Column based file formats allow for query engines to skip entire columns of data (e.g. Parquet). Parquet also stores metadata on row groups and allows query engines to skip rows when reading data. These performance enhancements can speed up queries from 0x - 100x or more (depends on how much data is skipped).

Data Lakehouse storage systems abstract the file metadata to a separate layer, which is even better than storing it in the file footer like Parquet does.

This DuckDB functionality is cool, but I think it's best to use it to convert JSON files to Parquet / a Lakehouse storage system, and then query them. JSON is a really inefficient file format for running queries.


Could anyone that uses these tools regularly tell if this a better than jq for querying?


It really depends. Using the relational operators to query deeply nested json objects is pretty painful (multiple layers of unnest's ) but fairly simple in jq. On the other hand, joining a couple of "flat" json files will be simple in DuckDB but not readily supported in jq. And if you already know sql thats a win ofc. i.e. I know how to group and aggregate using DuckDB since I know SQL, but currently have no idea about how to do that in jq. And once I find a solution in jq, that is not knowledge I can transfer to other tools.

jq's syntax is deliberately terse which works really really well for "one-liners", while sql queries tend to be more verbose.


I'm currently operating a very small (10s of millions of rows, ~20GB of total data) low-write MySQL DB with a couple different tables. I'm new to RDBs in general and am using MySQL because my thought was any "real" DB would be better than our previous "pipeline", which was just doing all our data filtering/merging with CSVs and Pandas in Python (extremely slowly, and frustrating).

I like the simplicity of DuckDB's proposal, but haven't seen much info about how fast to expect it to be in comparison with traditional RDBs, for smaller, mostly-read-only applications.


Scanning through a CSV can be quite close to querying a SQL database in performance when the SQL database doesn't have any indices. The primary benefits of using a SQL database for querying are (1) indices and (2) a declarative query language. Using DuckDB or SQLite's CSV/JSON support gets you the best of both worlds (minus indices), where you get the declarative query language and query planner but your data's still just CSV/JSON files.

For a dataset that size, I'd probably use SQLite to avoid having to manage a persistent MySQL process, especially when it's being used as an alternative to CSV files. That is, unless there's a MySQL/Postgres server already running I can just create a new database on.


> Using DuckDB or SQLite's CSV/JSON support gets you the best of both worlds (minus indices)

DuckDB automatically creates indexes for all general-purpose columns. However, they're not persisted.

https://duckdb.org/docs/sql/indexes.html


DuckDB is "column oriented" vs "row oriented". I have found it 10x* faster for queries on data your size compared to SQLite or MySQL or Postgres. The added advantage of it being a single file is very nice as well.

*I use the HoneySQL (clojure) library to programmatically build up queries and execute them via the JDBC driver.


Perhaps have a look at this article [1]

[1] https://www.vantage.sh/blog/querying-aws-cost-data-duckdb


I used to do this with Apache Drill a few years ago. There is something beautiful about downloading 1 binary and being able to query your files (json/csv/parquet etc) right away


Running into a couple issues right out of the gate:

1) Needed to increase maximum_object_size 2) Unexpected yyjson tag in ValTypeToString

Couldn't find a reference anywhere to that error. Loads into Snowflake without a hitch - which is where I normally query large JSON files.


Tried "select * from 'data.json' limit 10" on a 6.3 MB file (which feels relatively tiny…) and got the same `unexpected end of data. Try increasing "maximum_object_size"` error. (This is my very first attempt to use duckdb, so with respect I'm not invested enough to open an issue).


Thanks for trying it out!

Could you perhaps open an issue [1] or share the file with us so we could investigate the problem?

[1] https://github.com/duckdb/duckdb/issues


I tried to do "select * from ... limit 1" from a 1.7GB JSON file (array of objects), and I had to increase maximum_object_size to 1GB to make it not throw an error. But DuckDB then consumed 8GB of RAM and sat there consuming 100% CPU (1 core) for ever — I killed it after about 10 minutes.

Meanwhile, doing the same with Jq ("jq '.[0]'") completed in 11 seconds and consumed about 2.8GB RAM.

I love DuckDB, but it does seem like something isn't right here.


If you ever need to join two large dataframes, but are OOMing on the join, write them to disk as parquet files then use DuckDB to do the join. It's amazing what you can do on one machine thanks to DuckDB.


This isn't unique to duckdb. Almost all databases allow for sorting and joins of large tables that don't fit into memory.


Yes but if you're in a Jupyter notebook, you may not be directly connected to a DB. If you're using pandas, this unlocks some scalability before needing dask and a cluster.


Very nice! Does anyone know if we can query duckdb with a pandas dialect?



Hey, I maintain a tool called siuba that converts pandas methods to SQL, including for duckdb!

* duckdb example: https://siuba.org/guide/workflows-backends.html#duckdb

* supported methods: https://siuba.org/guide/ops-support-table.html


Not Pandas but very similar and (in my very biased opinion) better is PRQL (www.prql-lang.org) which as of yesterday you can now use in DuckDB!

See my comment above: https://news.ycombinator.com/item?id=35027712


One of the major reasons we bet big on DuckDb at MotherDuck is due to the extendable nature of the code base, as demonstrated by the frequency of major improvements and additions.


How does this compare with CouchDB, which also stores JSON and allows you to construct cached map/reduce views?

edit: Looks like DuckDB lets you use SQL-style queries


Really makes you wonder what's the difference between NoSQL and SQL databases if they can be rendered as each other.


this looks preddy cool. i was using the json datatype in mysql at the beginning of my project and we ended up yanking it out because of the way you query data within the json. it just started getting kludgey and i felt like i was trying to turn mysql into mongo, but suffering because its not.

will follow duckdb.


Besides XML, JSON is about the worst way to format tabular data, right?


If it's tabular, self-describing formats have way too much overhead. I ran a query with a tabular result in the neighborhood of 100 columns by 215k rows, and exported it in multiple formats:

  - CSV: 166mb
  - JSON: 795mb
That said, not all data is tabular.

DuckDB already supports Parquet, which supports structs and is a very good format for storing data for reporting workloads. But JSON is a standard interchange format, so a lot of people are going to want to do something with JSON payloads they receive from API calls.

I could definitely imagine a workload where you receive JSON from an API call, load it into DuckDB or similar to help with ETL, then store results in Parquet.


For me it depends a lot on the context. JSON is often very human readable (as long as it's not too deeply nested), fairly well defined (compared to CSVs), and most languages and software have easy out of the box support for parsing and manipulating it.

If I were building a system that had to deal with large amounts of tabular data that isn't directly consumed by humans, JSON wouldn't be my first choice nor my last.


It's interesting that JSON is still the format of choice for transmitting tabular data to SPAs and mobile apps. Granted, it's likely compressed. But still seems something more efficient like CSV would be better.


This is very true. DuckDB does not support JSON because it’s a good tabular format, but because JSON is ubiquitous, and there are many use cases where querying JSON dumps for analytics is useful.


My love for line-based data formats have increased over time. CSV, JSON-string per line and so on. You can always append to the data and you can deserialize line-by-line.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: