> Am confused. Normally you'd modify buffers after they've been acked at the application layer. Do samba clients not send acknowledgements?
If the file is being locked against changes while the network transmission is in progress (e.g. the "stable pages" idea mentioned in the article), then in normal circumstances (unless the networking subsystem is very overloaded) the network transmission will be completed quickly, so the file will only be briefly locked against changes. By contrast, waiting for the client to acknowledge can take a lot longer, and will result in the file being locked for a lot longer, which is much more likely to have a negative impact on applications, other SMB clients, etc. Consider if you had a database file (bad practice I know, but some people will do it anyway) being read/written over SMB, blocking all writes until any outstanding reads are acknowledged could have a big negative performance impact.
Hence, they are okay with locking things in place until the network transmission is complete, but not until an acknowledgement happens. But, the splice() API doesn't provide any way to notify that the destination has finished processing all the writes the splice call sent to it.
The splice system call could simply not return until the acknowledgment is in. That doesn’t solve the problem completely though because if a packet is dropped, it has to be retransmitted but the original data might have been modified which probably nobody expects.
And thus become useless, since people using splice for performance likely use some form of non blocking IO.
edit: it also works poorly for blocking IO: one doesn’t splice from a file to a socket — one splices from a file to a pipe and then to a socket. Which splice() call is supposed to block?
The call is only supposed to return when all the data has been sent, so both. This whole discussion is just about retransmits.
This does not make the call useless as the performance gain is supposed to be from avoiding copying the data from the kernel to userspace which still happens.
> The call is only supposed to return when all the data has been sent, so both.
Good luck. Splice from file to socket doesn’t send any data — it just references it. So if you mmap a file writably and then splice from it to a pipe, either splice needs to copy it or write-protect the mapping to enable copy-on-write or take a reference that will potentially propagate future changes from the mapping to the pipe. Write-protecting the mapping is slow, so pretty much all options don’t work. Blocking until the reference is gone makes no sense, because reference won’t go away after until the caller splices to the network, which will never happen. Deadlock.
Depending on how you interpret the man page, it’s either vague or wrong.
But that’s not what I’m trying to convince you of. I’m trying to convince you that the semantics you seem to want cannot be implemented efficiently. The kernel isn’t a magic box that can do anything you want. It’s a computer program that happens to run at a high privilege level and gets to use some fun hardware features. There is no fun hardware feature that magically and efficiently snapshots memory without copying it. There is a magic hardware feature that “write-protects” a memory mapping and notifies the kernel when someone tries to write to it, but file data in page cache may have an arbitrarily large number of mappings, write protecting a mapping is slow (especially on x86, which is, in many respects, showing its age as an architecture), and the kind of fancy software bookkeeping needed to maintain the fiction of copy-on-write is neither straightforward nor particular efficient.
If you rent x86 servers and instruct them to fiddle with the MMU for every few kB of data sent, AWS will happily take all your money while sending a remarkably small data rate per core. You will not get Netflix-style line rate performance like this.
Sendfile() works fine and is pretty fast and avoids this whole problem so you’re not going to convince me it’s impossible or that having the kernel read a file into a buffer and then send it somewhere is fast while taking a copy-on-write on the pages when it is already in memory is slow.
It’s just that Linus doesn’t want to change splice() to do what the Samba people want, which is fine but doesn’t mean it’d be impossible.
If the file is being locked against changes while the network transmission is in progress (e.g. the "stable pages" idea mentioned in the article), then in normal circumstances (unless the networking subsystem is very overloaded) the network transmission will be completed quickly, so the file will only be briefly locked against changes. By contrast, waiting for the client to acknowledge can take a lot longer, and will result in the file being locked for a lot longer, which is much more likely to have a negative impact on applications, other SMB clients, etc. Consider if you had a database file (bad practice I know, but some people will do it anyway) being read/written over SMB, blocking all writes until any outstanding reads are acknowledged could have a big negative performance impact.
Hence, they are okay with locking things in place until the network transmission is complete, but not until an acknowledgement happens. But, the splice() API doesn't provide any way to notify that the destination has finished processing all the writes the splice call sent to it.