Hacker Newsnew | past | comments | ask | show | jobs | submit | richraposa's commentslogin

Nice. I did the same thing on ClickHouse - took 11 seconds:

    SELECT sum(size)
    FROM url('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/0000{01..55}.parquet')
    
    Query id: 9d145763-0754-4aa2-bb7d-f6917690f704
    
    ┌───────sum(size)─┐
    │ 159344011148016 │
    └─────────────────┘
    
    1 row in set. Elapsed: 11.615 sec. Processed 54.08 million rows, 8.50 GB (4.66 million rows/s., 731.83 MB/s.)
    Peak memory usage: 458.88 KiB.


I love and use both clickhouse and duckdb all the time, but it's become almost comical how senior clickhouse leadership comment on any duckdb-related HN thread like clockwork. And only very rarely bring up their affiliation :)


You mean this username composed of a random assortment of Latin characters isn't an unaffiliated, independent commentator?!


...says the completely anonymous internet guy. I'm laughing as much as you are.

I wasn't trying to imply anything negative about DuckDB with my post - was just sharing how ClickHouse does the same thing. FWIW: the blog author added my query to his blog, so my non-combative comment was politely received.


but what the person said is true: seems like clickhouse comments descend upon every recent duckdb post as if it’s some sort of competition or born out of inferiority complex.

clickhouse is really cool tech. duckdb is really cool tech. i grow weary of the CH infiltration to the point where it’s working against the intent to make me love CH


Is it so annoying? I am happy that we have at least two projects that can query big datasets with very reasonable performance.



yes it is quite annoying. i would love to see a CH dedicated post to extol its virtues and not a piggyback “us too” comment on a duckdb post.


Hey man - not trying to get on the bad side of one of my favourite databases! Making an observation, one that others have made as well.


Do you know how much data it had to download to run that query? Did it pull all 8GB?


ClickHouse just reads the values from the one column being summed: https://www.markhneedham.com/blog/2023/11/15/clickhouse-summ...


The same query takes only 1.4 seconds on my server, so I assume that the query does not read all 8 GB.


I ran a network monitor while running that query, it pulled down ~290MB


Almost exactly the same as DuckDB then - I measured ~287MB.


I ran the same queries and got similar results but the bandwidth utilization I measured was significantly different. On the same fly.io instance with 1vCPU/256MB both queries completed successfully but ClickHouse/chdb reached 10MB/s (max) and logically completed the count faster, while DuckDB only peaked at around 2.5MB/s.

This might be due to the tiny resources but I like rock bottom measurements. Did anyone else notice a similar bandwidth utilization gap?


Probably compression


If that's raw network transfer, it's probably just a difference in headers or MTU size. Larger MTU -> fewer headers required. Maybe a difference in network configuration that requires more or less data in the header.


CGW = Can't Go Wrong...love it


ClickHouse 10 years before DuckDB existed:

SELECT * FROM url('https://example.com/*.csv')


not as simple as from 'a.csv' and 10 years without recursive CTES?


I suppose if you had data in a format that DuckDb doesn't work with, like Protobuf, Avro, ORC, Arrow, etc. ClickHouse reads and writes data in over 70 formats


ingestion of these events during peaks at 500k events per minute. You can't ingest them individually into Clickhouse or most other databases.

Turn on async_insert or use a Buffer table engine and you can easily insert them individually into ClickHouse


That's interesting! I don't have much experience with Clickhouse, especially not in the last two years. I'll have to try this out myself. That's a pretty incredible if it can handle batching internally.


Cloudflare famously uses ClickHouse for web analytics - inserting over 6M rows per second: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...


A lot has changed (grown) in the 5+ years since that was published.


It's definitely cool to be able to query data in place instead of inserting it into a table. You can use clickhouse-local to do the same thing with JSON files (and with dozens of other data formats): https://clickhouse.com/blog/worlds-fastest-json-querying-too...


You should try it on ClickHouse Cloud - the pricing is not based on bytes, so the resulting cost would be near zero compared to $1M on BigQuery.

And it would probably execute faster... :)

(Disclaimer: I work at ClickHouse)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: