Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> nor it could ever be

Well, never say never... In a world of immutable objects, an object's identity (reference) can be derived from its content. Look, for example, at how objects are stored in Git.



And of course, by now everyone has seen https://stackoverflow.com/questions/3475648/sha1-collision-d... . It's an interesting problem space, trying to generate a unique ID for the contents of a thing where the size of that unique ID is less than the size of the thing. If you consider that = the thing (a deep representation of the object) is the most obvious way to uniquely identify the thing, but then figure you can do better, it would seem that it would always be technically possible, outside of lossless compression algorithms (like gzip), to generate something with the same hash. I'm sure that there is tons of theory on this and that my terminology isn't correct here. Can someone steer me in the right direction regarding what this type of thing is called?

EDIT: important detail I'd forgotten: hash functions take arbitrary-length data in and spit out fixed-length hashes. I learned about https://en.wikipedia.org/wiki/Perfect_hash_function and https://crypto.stackexchange.com/questions/29264/why-cant-we... is relevant. Relevant quote: """ Mathematically speaking, there is no such thing as a collision-free hash. Practically speaking, there is. """


A hash function that takes arbitrary data and generates fixed-length hashes will always have collisions, because there is simply more possible inputs than possible outputs. Also known as Pigeonhole principle[0] - if you have 8 pigeons and 5 holes for them, and you want to put all pigeons in the holes, then at least 3 of them will have to go to an already occupied one.

The practical aspect of making hash function "collision-free" is by using trickery and dark magic to reduce the probability of collision down to somewhere below "Earth just got hit by an asteroid, we all have other problems". You want to go really, really low because hashes are mostly used to quickly verify data, and malicious actors will be happy to throw compute at trying to generate collisions so they can alter the data.

As for how the trickery and dark magic is performed, I'm a wizard of too low level to explain these details to you; all that I recall is that you try and make the hash maximally sensitive to every bit of the input data, so that a random bitflip anywhere has a huge probability of greatly changing the hash.

--

[0] - https://en.wikipedia.org/wiki/Pigeonhole_principle




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: