Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is incorrect -- it is not possible to implement Protobuf in a way that achieves the notion of "zero-copy" that Cap'n Proto and FlatBuffers achieve.

This probably comes down to a disagreement on what "zero-copy" means.

Some people use the term "zero-copy" to mean only that when the message contains a string or byte array, the parsed representation of those specific fields will point back into the original message buffer, rather than having to allocate a copy of the bytes at parse time.

Cap'n Proto and FlatBuffers implement a much stronger form of zero-copy. With them, it's not just strings and byte buffers that are zero-copy, it's the entire data structure. With these systems, once you have the bytes of a message mapped into memory, you do not need to do any "parse" step at all before you start using the message.

For example, if you have a multi-gigabyte file formatted with one of these, you can mmap() it, and then you can traverse the message tree to read any bytes of the message except the chain of pointers (parent to child) leading to that one datum. Aside from the mmap() call, you can do all this without even allocating any memory at all.

That is absolutely not possible with Protobuf, because Protobuf encoding is a list of tag-value pairs each of which has variable width. In order to read any particular value, you must, at the very least, linearly scan through the tag-values until you find the one you want. But in practice, you usually want to read more than one value, at which point the only way to avoid O(n^2) time complexity while keeping things sane is to parse the entire message tree into a different set of in-memory data structures allocated on the heap.

That is not "zero-copy" by Cap'n Proto's definition.

(Disclosure: I am the author of Cap'n Proto and Protobuf v2.)



Nothing of your comment was false but there's no intrinsic value to your stronger definition of zero-copy. A direct mapping to memory is not always optimal for performance. Indeed, there are high-performance computation packages that compress data structures in L1-cached-sized blocks, to save main memory bandwidth. So, you've used the word "achieve" to decorate an outcome that might not be optimal.

By the way I worked on protobuf performance at Google for years and we could never get flatbuffers to go any faster.


> no intrinsic value to your stronger definition of zero-copy.

Whoa, that's a very strong statement. But then the rest of the paragraph gets a lot weaker.

> there are high-performance computation packages that compress data structures in L1-cached-sized blocks

This seems like a non sequitur. Of course hand-tuned data structures can achieve higher performance than any serialization framework, but what does that have to do with zero-copy vs. protobuf? Are you suggesting that protobuf encoding would be a good choice for these people?

> So, you've used the word "achieve" to decorate an outcome that might not be optimal.

I'm not sure why "achieve" would imply "optimal". Of course whether this is an advantage depends on the use case.

There are many cases where zero-copy doesn't provide any real advantages. If you're just sending messages over a standard network socket, then yeah, zero-copy probably isn't going to make things faster. There are already several copies inherent in network communication.

But if you have a huge protobuf file on disk and you want to read one field in the middle of it, that's just not something you can do in any sort of efficient way. With zero-copy, you can do this trivially with mmap().

Or if you're communicating over shared memory between processes on the same machine, then the entire serialize/parse round trip required with protobuf is 100% waste. Zero-copy would let you build and consume the structure from the same memory pages.

These seem "intrinsically valuable"?

> we could never get flatbuffers to go any faster.

What use case were you testing? Did you test any zero-copy serializations other than flatbuffers?

I've heard from lots of people that say Cap'n Proto beat Protobuf in their tests... but it definitely depends on the use case.


> Or if you're communicating over shared memory between processes on the same machine, then the entire serialize/parse round trip required with protobuf is 100% waste.

This is the part I don’t agree with. What I’m saying there is value in using encoded structures not only between servers and not only over shared memory but even within a single process. Yes, you discard the ability to just jump to any random field, but that is not always important. Often it can be better to spend some compute cycles and L1 accesses to save main memory accesses. If you are having to make full access to some kind of data anyway, then packing it makes a ton of sense. Consider any kind of delta-encoded column of values ... you can’t seek within it, but if the deltas are smaller than the absolutes, this can save massive amounts of main memory bandwidth. This is why I argue that representing something as a C struct in main memory is not obviously advantageous, outside some given workloads.

As for flatbuf at google I’m sure you’re aware that the only way to get the kind of mindshare you’d need to ship it would be to make websearch measurably faster.


OK, so we're talking about a use case where you're compressing data in main memory and trying to decompress it only within L1 cache. I guess there must be a lot of data sitting around in RAM that isn't accessed very often. Search index leaf nodes I suppose?

It doesn't seem to me like Protobuf is ideal for this use case, but sure, I see how the light compression afforded by Protobuf encoding could lead to a win vs. bare C structs.

I think a better answer here, though, would be to use an actual compression algorithm that has been tuned for this purpose.

Of course, then the uncompressed data needs to be position-independent, so no (native) pointers. You could use something hand-rolled here... but also, this is exactly the problem zero-copy serializations solve, so they might be a good fit here. Hmm!

I'd be pretty interested to compare layering a zero-copy serialization on top of compression vs. protobuf encoding in these high-performance computing scenarios. Is that something you tried doing?


If this is your thing, then this is your book:

https://www.cambridge.org/core/books/compact-data-structures...


> Yes, you discard the ability to just jump to any random field, but that is not always important.

I don't think this is the criticism being raised.

The main criticism being raised against non-zero-copy serialization is that this often requires maintaining different memory representations for the same value - the copies are the consequence of transforming from one representation to another one.


We do that all the time in high-performance computing. You keep a packed representation in memory and unpack it in small pieces to operate on it. Sparse matrices, compressed columns, etc. This is not evil, it’s an adaptation to the way the machine works. Saying that Kenton’s definition of zero-copy is unconditionally better is an aesthetic argument and I don’t buy it.


You are still not getting it.

They are not talking about "unpacking on the fly for processing", but rather about "unpacking on memory to be able to call an opaque API outside your control that expects the unpacked representation". That requires copying in-memory to interface with that API.

Your approach only works if you are willing to "re-implement the world" to interface with whatever packed format suits your application.

With zero-copy serialization you don't have to do that.


We can both advocate for different perspectives on this issue without either of us “not getting it”.


> Saying that Kenton’s definition of zero-copy is unconditionally better

It feels like you're conflating things here.

It's not always a better algorithm, but it's a better definition.


There is a big difference to me. I haven't used any of those systems but I have written plenty of console games where the artists and designers want to fill memory. That means I don't have memory for 2 representations, unparsed and parsed. I also want fast loading so loading say 4k at a time into some temp buffer and parsing into memory is also out. I load the file directly into memory, fix up the pointers, and use it in place.


Not-Faster on what platform? In Borg, in Google3 code, deployed on a fast machine with a nice fast wide memory bus and a large cache?

What about in embedded code, or in a game? A place where memory bandwidth is scarce, or where we're trying desperately to reduce the number of syscalls and jumps back and forth between kernel and user space?

Having the entire payload memory mapped, and copies avoided, makes an absolutely huge difference once these kinds of concerns are real. Having something mmap'd in theory means it's nice and hot and fresh in the kernel's mind, and potentially kept in cache.

Some people in my team had gRPC foisted on them to run on an ARM Cortex class device running a real time OS. It boggled my mind they were even able to get it to ship. Using something like flatbuffers would have made a lot more sense.

At Google protobuf is the veritable "once I have a hammer everything looks like a nail", to the point where I've seen protobufs in JSON payloads, or vice versa, or even deeper... protobof in JSON inside a Chromium Mojo payload... because, well, how good could it be without protobuf?


> Some people in my team had gRPC foisted on them to run on an ARM Cortex class device running a real time OS. It boggled my mind they were even able to get it to ship.

I shipped an embedded project using an RTOS and < 256kB RAM using protocol buffers (even nested ones) and zero-copy deserialization for byte arrays some time ago. We used nanopb, and it worked just fine if you understand how it works - although it certainly is less pleasant to work with than a fuller implementation which would just have copied out the internal bytes into new arrays and not have let us deal with lots of internal pointers into byte arrays.

Overall using Protocol Buffers was a great success in that project, since we could share schemas between IoT devices and the backend, and were able to generate a lot of code which would otherwise have been hand-written.

> Using something like flatbuffers would have made a lot more sense.

It might be able to solve the same problem. But it also needs to answer the questions: Is a suitable library available for all consumers of the data-structure? I don't think this was the case for our use-case, so it wouldn't have made more sense to use it back then.


> Not-Faster on what platform? In Borg, in Google3 code, deployed on a fast machine with a nice fast wide memory bus and a large cache?

Weird rebuttal, considering that my argument is that the copying nature of protobuf can save memory bandwidth.


> my argument is that the copying nature of protobuf can save memory bandwidth.

Huh? An extra pass over the data to parse it obviously uses more memory bandwidth.

I might understand your argument if parsing converted the data into a memory-bandwidth-optimized format, but the protobuf parsed form is certainly not that, unless things have changed very drastically since I worked on it.


Right but the parsed form is exactly what I meant when I said it’s an aspect of the implementation. The generated C++ code you get from Google’s protoc is a sparse thing, to be sure. But that is an artifact of the implementation. You can do anything you want with a protobuf and there are numerous independent implementations in the wild, in many languages.


I believe the point is that the actual wire format of protobuf is not amenable to memory mapping and direct manipulation or access. So to use it this way a copy would have to be made. The appeal of cap'n'proto and flatbuffers is the ability to map and work with the serialized format with minimal overhead.

At Google scale protobuf works perfectly fine. It's our lingua franca and a lot of work (apparently by yourself included) has gone into making it performant. But it comes with normative lifestyle assumptions.

Sure you could change the wire format and implementation to be mappable; but then it wouldn't be compatible with the mainstream implementation.


Couldn't find your ldap internally.. would be happy to chat on what you tried to make flatbuffers go faster. Mine you can find on the go link page for flatbuffers.

FlatBuffers can be accessed instantly without deserialization or allocation, so clearly in some cases a huge speedup is possible. If in your case there was no speedup, there must be other bottlenecks.


I don’t work there any more. If you want to talk to someone who knows a lot more about it than me, try haberman (who also hangs out here).


> there's no intrinsic value to your stronger definition of zero-copy

> A direct mapping to memory is not always optimal for performance.

You comment seems to imply that the first statement follows from the second, but it does not. You're right, a direct memory mapping might not be optimal in some situations, but then again in some others it might be optimal. So this feature isn't always useful to everyone, but it doesn't follow that it's never useful to anyone.

Even if you worked on this on a range of projects at Google, isn't it possible that there are people working on systems that have rather different performance characteristics than Google's systems?


> Aside from the mmap() call, you can do all this without even allocating any memory at all.

I think this is the important point when it comes to discussing zero-copy. I've written a custom protobuf implementation for java which can do exactly that.

It's a bit tricky since protobuf supports recursive messages and java's Unsafe is not as powerful as what you can have in C++. My trade-off was to require the caller to pre-allocate messages needed before parsing the data. This works great when working with multi-gigabyte files where you want to process a large number of (possibly nested) messages, but is not as ergonomic as normal protobuf code.

It obviously doesn't come for free, as you need to do a linear scan to find those tag-values, but there are ways to speed that up too, so it becomes very fast in practice.

I'm sure Cap'n Proto and FlatBuffers are faster for some use-cases (I haven't tested), but a very important point for me is to be wire-compatible with protobuf3 and its ecosystem... and still be zero-copy/zero-alloc.


Author of protobluff [1] here - a zero-copy, mostly stack-based implementation for Protocol Buffers in C. Yes, you must scan through the message to find the corresponding tag/value pair you're interested in, but in many cases, it can be fast enough if you're only interested in a few values of a large message. Sure, using a vtable is much faster, as it provides O(1) lookup semantics, but sometimes you may not be able to change to another format without touching the entire stack.

[1]: https://github.com/squidfunk/protobluff


Sounds like there's a point to be made for referential integrity too. That is, if a struct contains string A twice, when you read it back in you'd want both those pointers to be identical. You'd get this for free with Cap'n Proto, but it would require extra care with "one-copy" or looser definitions of "zero-copy."


Heh, well, you could get it for free in Cap'n Proto if Cap'n Proto allowed pointer aliasing. It doesn't, though, because if it did, then messages would not be trees, they'd be graphs, which ruins a lot of stuff. For example, a very common thing to do with a message is copy one branch of the tree into a different message. Deep-copying a branch of a tree is easy. Deep-copying a branch of a graph, though -- what does that even mean?


Copying a branch of a DAG has about the same meaning as copying a branch of a tree, right?


For a DAG, maybe, but it requires a lot more bookkeeping. Now you have to remember all the pointers you've seen before in order to detect dupes. To do that you probably need a hash map and some dynamic memory allocation, ugh.

And what happens if you copy two different branches of one message into another, and they happen to share some children? Do you have to keep your seen-pointer map around across multiple copies?

For a cyclic graph, things get more confusing. Copying one branch of a fully-connected cyclic graph always means copying the entire graph. Apps can easily get into trouble here. Imagine an app that implements its own tree structure where nodes have "parent" pointers. If they try to copy one branch into another message, they accidentally copy the entire tree (via the parent pointers) and might not even realize it.

The one way that I think pointer aliasing could be made to work is if pointers that are allowed to alias are specially-marked, and are non-owning. So each object still has exactly one parent object, but might have some other pointers pointing to it from elsewhere. A copy would not recurse into these pointers; it would only update them if the target object happened to be part of the copy, otherwise they would have to become null.

But I haven't yet had any reason to try implementing this approach. And apps can get by reasonably well without it, by using integer indexes into a table.


This problem (persistent graph structures) has been solved since the 90s: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478...


This isn't a question of finding a magical solution, it's a question of trade-offs in performance, complexity, and usability that any solution necessarily imposes, and whether those trade-offs are worth it to support a feature that 99% of use cases don't need.

"Skyscrapers have been solved since the 20's" doesn't answer whether I should use steel beams when building a house.


Maybe it should be called zero parse and zero copy? Seems like two different terms...


Worth noting that by the weak definition of zero-copy JSON can also have zero-copy deserialization; many JSON embedded libraries either write a null terminator in the original buffer or return a pointer and a length.


Could you point to documentation on how does Cap'n Proto achieve this? Does it keep a header with offsets of bye positions for individual fields? What happens when a variable sized field is edited?


https://capnproto.org/encoding.html

Records in Cap'n Proto are laid out like C structs. All fields are at fixed offsets from the start of the structure. For variable-width values, the struct contains a pointer to data elsewhere in the message.

Each new object is added to the end of the message, so that the message stays contiguous. This does imply that if you resize a variable-width object, then it may have to be moved to the end of the message, and the old space it occupied becomes a hole full of zeros that can't really be reused. This is definitely a down-side of this approach: Cap'n Proto does not work great for data structures that are modified over time. It's best for write-once messages. FlatBuffers has similar limitations, IIRC.


Thanks for this explanation!

Off-topic: I see a few links in your HN profile, is that the best place to keep up with the projects you're working on?


I think all my major projects are in my profile... Cloudflare Workers (and Cap'n Proto, which it uses) are my day job; I don't get much time to work on other projects these days unfortunately.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: