It is very important to note that this is the probability that any two hashes collide in a set of hashes. It is not the probability to find a second plaintext that hashes to the same hash as a given one. It's far harder to find a colliding value to a given one than to just grab a group of values and have two of them hash to the same hash.
Indeed, pretty much exactly that point is made in the third box on the right:
In case you are one of the many, many people who get this wrong
and are astonished, you may be thinking of a different question.
My birthday is a specific date. If we assume birthdays are
uniformly distributed, how many people must you ask before you
find someone who has the same birthday as me?
On average, about 182.
If you ask some 150 to 200 people, the odds are about 50% that
you'll find someone who shares my specific birthday.
But that's not the question I asked. They don't just need to share
my birthday, it can be that any birthday is shared among them, and
that changes the odds dramatically.
I don't see how that point relates to Xylakant's comment. The text you quoted basically points to the difference between these two cases:
1) that if you have a specific hash, how many more hashes would have to be generated to get the same hash.
2) how many hashes would you have to generate before a collision occurs between any of them.
So in the two cases above we're dealing with sets of hashes, and we don't care how they are generated.
Xylakant's comment was about a different type of collision. This collision happens because you are mapping an arbitrary large text into a fixed size hash, which means that different texts can map to the same hash, hence creating a collision.
> It is very important to note that this is the probability
> that any two hashes collide in a set of hashes. It is not
> the probability to find a second plaintext that hashes to
> the same hash as a given one.
ColinWright:
> Indeed, pretty much exactly that point is made ...
> > In case you are one of the many, many people who get this wrong
> > and are astonished, you may be thinking of a different question.
> > My birthday is a specific date. If we assume birthdays are
> > uniformly distributed, how many people must you ask before you
> > find someone who has the same birthday as me?
> > ...
> > But that's not the question I asked. They don't just need to share
> > my birthday, it can be that any birthday is shared among them, and
> > that changes the odds dramatically.
dsego:
> I don't see how that point relates to Xylakant's comment.
OK, I'll try to explain my thinking:
> The text you quoted basically points to the difference
> between these two cases:
> 1) that if you have a specific hash, how many more hashes
> would have to be generated to get the same hash.
> 2) how many hashes would you have to generate before a
> collision occurs between any of them.
Yes.
> So in the two cases above we're dealing with sets of hashes,
> and we don't care how they are generated.
That's true.
> Xylakant's comment was about a different type of collision.
> This collision happens because you are mapping an arbitrary
> large text into a fixed size hash, which means that different
> texts can map to the same hash, hence creating a collision.
If you're using cryptographic hashes then I believe these amount to the same thing.
The distinguishing characteristics of a cryptographic hash are that it has no internal structure, and it is completely unpredictable based on knowledge of similar source texts. In particular, techniques such as differential analysis don't work.
In that sense, finding a text to map to a hash to give a collision is exactly the same as picking a random number that happens to match. The act of starting with a text and using that to create a hash is identical to the act of choosing a random element of the hash space. If these two things were not identical, then the hash would have some sort of predictability or structure that related it to the source text.
The situation is different if you are using non-cryptographic hashes such as CRCs. Perhaps I should edit the original article to help make that distinction, but I'm not really sure it's worth it.
Actually I was trying to express that the birthday paradox does not apply when you to find a plaintext that evaluates to a known hash, as you'd need to do for password cracking or forging a signed email for example[1]. The birthday paradox only gives you the probability that of a random set of hashes, two are the same. It's "if I generate n random pieces of plaintext, how big is the chance that two of them generate the same hash." Actually it gives you the probability of picking the same element twice from a finite set of elements when you pick n times.
[1] When generating emails more restrictions apply: The colliding plaintexts have to be at least somewhat coherent and probably should express something the attacker wants to express.
Yes, and that's exactly the point that is addressed in the box I quoted.
You're exactly right, and I believe it's covered. I didn't go into detail about the problems involved in generating a plain text that hashes to a specific hash,such as you mention. I did simply mention that the problem I'm talking about is not that one.
So I don't understand the point dsego was making, because I think my reference is relevant.
At this point I'm no longer sure it really matter.