> Am confused. Normally you'd modify buffers after they've been acked at the app...

tinus_hn · on March 3, 2023

The splice system call could simply not return until the acknowledgment is in. That doesn’t solve the problem completely though because if a packet is dropped, it has to be retransmitted but the original data might have been modified which probably nobody expects.

amluto · on March 3, 2023

And thus become useless, since people using splice for performance likely use some form of non blocking IO.

edit: it also works poorly for blocking IO: one doesn’t splice from a file to a socket — one splices from a file to a pipe and then to a socket. Which splice() call is supposed to block?

tinus_hn · on March 3, 2023

The call is only supposed to return when all the data has been sent, so both. This whole discussion is just about retransmits.

This does not make the call useless as the performance gain is supposed to be from avoiding copying the data from the kernel to userspace which still happens.

amluto · on March 3, 2023

> The call is only supposed to return when all the data has been sent, so both.

Good luck. Splice from file to socket doesn’t send any data — it just references it. So if you mmap a file writably and then splice from it to a pipe, either splice needs to copy it or write-protect the mapping to enable copy-on-write or take a reference that will potentially propagate future changes from the mapping to the pipe. Write-protecting the mapping is slow, so pretty much all options don’t work. Blocking until the reference is gone makes no sense, because reference won’t go away after until the caller splices to the network, which will never happen. Deadlock.

tinus_hn · on March 4, 2023

I’m not sure why you’re trying to convince me it can’t work like the documentation says it does. Is the documentation wrong?

amluto · on March 4, 2023

Depending on how you interpret the man page, it’s either vague or wrong.

But that’s not what I’m trying to convince you of. I’m trying to convince you that the semantics you seem to want cannot be implemented efficiently. The kernel isn’t a magic box that can do anything you want. It’s a computer program that happens to run at a high privilege level and gets to use some fun hardware features. There is no fun hardware feature that magically and efficiently snapshots memory without copying it. There is a magic hardware feature that “write-protects” a memory mapping and notifies the kernel when someone tries to write to it, but file data in page cache may have an arbitrarily large number of mappings, write protecting a mapping is slow (especially on x86, which is, in many respects, showing its age as an architecture), and the kind of fancy software bookkeeping needed to maintain the fiction of copy-on-write is neither straightforward nor particular efficient.

If you rent x86 servers and instruct them to fiddle with the MMU for every few kB of data sent, AWS will happily take all your money while sending a remarkably small data rate per core. You will not get Netflix-style line rate performance like this.

tinus_hn · on March 5, 2023

Sendfile() works fine and is pretty fast and avoids this whole problem so you’re not going to convince me it’s impossible or that having the kernel read a file into a buffer and then send it somewhere is fast while taking a copy-on-write on the pages when it is already in memory is slow.

It’s just that Linus doesn’t want to change splice() to do what the Samba people want, which is fine but doesn’t mean it’d be impossible.