It's a throughput problem, not an activity problem. Normally, since the rows are...

_wmd · on March 4, 2016

This is so far off the mark..

Even in the absence of variable-width fields, the presence of nullable fields causes the majority of database tables to have variable-width rows. In any case, neither of these are reasons why common databases do or do not lay rows out sequentially on disk (some do, some don't).

Even if the DB server selectively read columns of each row (none of the common open source SQL databases do), they do so via the OS, which works in terms of pages. Reading a single byte of a page will cause a minimum of 4kb of IO to be made to the disk.

Now, unless you're using a DB server that uses O_DIRECT or POSIX_FADV_RANDOM (I just checked and Postgres doesn't), Linux will aggressively readahead at least (it's tunable) 128kb for any random read by default, so even issuing a one byte read to the kernel, device IO will still only occur in a minimum of 128kb chunks, with the remainder living in the page cache until userspace requests it.

Database servers additionally are very likely to have their own larger-than-a-byte-sized buffers in order to avoid system call latency, so the requests they make are never going to be quite so small.

The logic being that in the days of spinning media, evicting 124kb of cold page cache in favour of avoiding a seek a few microseconds later was definitely worth it (a seek being a ~14ms stall on rotating disks)

uxcn · on March 4, 2016

> Even if the DB server selectively read columns of each row (none of the common open source SQL databases do), they do so via the OS, which works in terms of pages. Reading a single byte of a page will cause a minimum of 4kb of IO to be made to the disk.

This is why I said it was high level, but hopefully illustrated the point. In addition to the disk page size, you also have all the various metadata associated with the file(s). So, reading a byte from a page can imply reading even more data than the block size (4KiB current).

> Now, unless you're using a DB server that uses O_DIRECT or POSIX_FADV_RANDOM (I just checked and Postgres doesn't), Linux will aggressively readahead at least (it's tunable) 128kb for any random read by default, so even issuing a one byte read to the kernel, device IO will still only occur in a minimum of 128kb chunks, with the remainder living in the page cache until userspace requests it.

AFAIK, Linux only reads ahead if it detects a sequential pattern, or if you specify POSIX_FADV_SEQUENTIAL (double normal). But, as far as the query is concerned, all of the data read that isn't necessary is effectively subtracted from the overall throughput.

I was trying to illustrate the importance of seek latency (~80us vs. ~9-14ms), but yes there are a myriad of other concerns when you're trying to maximize disk throughput.

scurvy · on March 4, 2016

Not trying to nitpick, but don't most people running MySQL with innodb set O_DIRECT?

Dylan16807 · on March 4, 2016

It doesn't have to skip over data if that would slow it down. I would expect your typical database to have some kind of index or bitmap that can tell it what to grab fast enough to saturate the disk while avoiding unused data, but if it has to fall back to vacuuming up 1GB at a time then so be it.