richraposa's comments

richraposa · on Nov 14, 2023

Nice. I did the same thing on ClickHouse - took 11 seconds:

    SELECT sum(size)
    FROM url('https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/0000{01..55}.parquet')
    
    Query id: 9d145763-0754-4aa2-bb7d-f6917690f704
    
    ┌───────sum(size)─┐
    │ 159344011148016 │
    └─────────────────┘
    
    1 row in set. Elapsed: 11.615 sec. Processed 54.08 million rows, 8.50 GB (4.66 million rows/s., 731.83 MB/s.)
    Peak memory usage: 458.88 KiB.

zxexz · on Nov 17, 2023

I love and use both clickhouse and duckdb all the time, but it's become almost comical how senior clickhouse leadership comment on any duckdb-related HN thread like clockwork. And only very rarely bring up their affiliation :)

chatmasta · on Nov 17, 2023

You mean this username composed of a random assortment of Latin characters isn't an unaffiliated, independent commentator?!

richraposa · on Nov 17, 2023

...says the completely anonymous internet guy. I'm laughing as much as you are.

I wasn't trying to imply anything negative about DuckDB with my post - was just sharing how ClickHouse does the same thing. FWIW: the blog author added my query to his blog, so my non-combative comment was politely received.

swasheck · on Nov 17, 2023

but what the person said is true: seems like clickhouse comments descend upon every recent duckdb post as if it’s some sort of competition or born out of inferiority complex.

clickhouse is really cool tech. duckdb is really cool tech. i grow weary of the CH infiltration to the point where it’s working against the intent to make me love CH

datadeft · on Nov 17, 2023

Is it so annoying? I am happy that we have at least two projects that can query big datasets with very reasonable performance.

cmdlineluser · on Nov 17, 2023

It's "interesting". https://news.ycombinator.com/item?id=34772367

swasheck · on Nov 17, 2023

yes it is quite annoying. i would love to see a CH dedicated post to extol its virtues and not a piggyback “us too” comment on a duckdb post.

zxexz · on Nov 17, 2023

Hey man - not trying to get on the bad side of one of my favourite databases! Making an observation, one that others have made as well.

simonw · on Nov 15, 2023

Do you know how much data it had to download to run that query? Did it pull all 8GB?

richraposa · on Nov 15, 2023

ClickHouse just reads the values from the one column being summed: https://www.markhneedham.com/blog/2023/11/15/clickhouse-summ...

zX41ZdbW · on Nov 15, 2023

The same query takes only 1.4 seconds on my server, so I assume that the query does not read all 8 GB.

hipadev23 · on Nov 17, 2023

I ran a network monitor while running that query, it pulled down ~290MB

simonw · on Nov 17, 2023

Almost exactly the same as DuckDB then - I measured ~287MB.

qxip · on Nov 19, 2023

I ran the same queries and got similar results but the bandwidth utilization I measured was significantly different. On the same fly.io instance with 1vCPU/256MB both queries completed successfully but ClickHouse/chdb reached 10MB/s (max) and logically completed the count faster, while DuckDB only peaked at around 2.5MB/s.

This might be due to the tiny resources but I like rock bottom measurements. Did anyone else notice a similar bandwidth utilization gap?

riku_iki · on Nov 17, 2023

Probably compression

daveguy · on Nov 17, 2023

If that's raw network transfer, it's probably just a difference in headers or MTU size. Larger MTU -> fewer headers required. Maybe a difference in network configuration that requires more or less data in the header.

richraposa · on Nov 14, 2023

CGW = Can't Go Wrong...love it

richraposa · on Oct 23, 2023

ClickHouse 10 years before DuckDB existed:

SELECT * FROM url('https://example.com/*.csv')

adulion · on Oct 23, 2023

not as simple as from 'a.csv' and 10 years without recursive CTES?

richraposa · on Oct 23, 2023

I suppose if you had data in a format that DuckDb doesn't work with, like Protobuf, Avro, ORC, Arrow, etc. ClickHouse reads and writes data in over 70 formats

richraposa · on April 15, 2023

ingestion of these events during peaks at 500k events per minute. You can't ingest them individually into Clickhouse or most other databases.

Turn on async_insert or use a Buffer table engine and you can easily insert them individually into ClickHouse

iknownothow · on April 16, 2023

That's interesting! I don't have much experience with Clickhouse, especially not in the last two years. I'll have to try this out myself. That's a pretty incredible if it can handle batching internally.

richraposa · on April 11, 2023

Cloudflare famously uses ClickHouse for web analytics - inserting over 6M rows per second: https://blog.cloudflare.com/http-analytics-for-6m-requests-p...

matsur · on April 11, 2023

A lot has changed (grown) in the 5+ years since that was published.

richraposa · on March 3, 2023

It's definitely cool to be able to query data in place instead of inserting it into a table. You can use clickhouse-local to do the same thing with JSON files (and with dozens of other data formats): https://clickhouse.com/blog/worlds-fastest-json-querying-too...

richraposa · on Nov 3, 2022

You should try it on ClickHouse Cloud - the pricing is not based on bytes, so the resulting cost would be near zero compared to $1M on BigQuery.

And it would probably execute faster... :)

(Disclaimer: I work at ClickHouse)