Question:
There is a base with a huge number of lines. All strings are unique. They contain English, Russian letters and numbers, may contain special characters. There are 1 to N words in a line.
Question: how can you make a unique hash for everyone?
The use of long
is possible, but it needs to fit within the scope of int
.
The algorithm will be described in c #, in theory, you can use asm to speed up, but we are thinking about it.
Answer:
Let's say we have a random hash function, M possible values, and N elements. Then the probability that the hashes of the elements will be different is 1 × (1 – 1/M) × (1 – 2/M) × … × (1 – (N-1)/M) = M! / ((MN)! × M N ) . With M = 2 32 and N = 8,000,000, it will be equal to 1.75 × 10 -3238 .
For comparison, the probability of a meteorite falling tomorrow is on your head more than the probability of no collisions in such conditions.
This value is more than small enough to make it impossible to brute force hash functions in search of a suitable one.
If the strings are unknown to you in advance, you can stop there. Any function you choose to use will almost certainly produce collisions on a previously unknown set of strings.
But if the strings are known in advance, then an ideal hashing algorithm can be built for them. It is done like this:
- Select the number of bits for the "highest" part of the hash. The number of bits h will give H=2 h different values.
- We take any hash function and divide the source strings into H baskets according to it. On average, we get N1 = N/H rows in the basket.
- For each basket, we select (automatically, of course) a separate ideal hash function for the lower bits of the hash. It will be M1 = M/H different values.
In order for step 3 to last an adequate time, the value of M1 2 must be comparable with N1 . We get – M 2 / H 2 ∽ N / H – that is, we must take H of order M 2 / N .
We get H = 16 384 (h = 14), M1 = 288, N1 = 262 144.
In total, the algorithm for obtaining a hash function will be as follows:
- We calculate the hash function 1 modulo 16 384, store it in the h1 variable.
- We go to the table, calculated in advance, by index h1. Get parameters for hash function 2.
- We consider the hash function 2 modulo 262 144, store it in the h2 variable.
- We consider
(h1 << 18) + h2
– this is the unique hash.
As a hash function 2, you need to choose something parametric so that there is a place to choose random parameters. That is, a polynomial hash is a good fit here.
As hash function 1, you can choose any good hash function.
The algorithm turns out to be quite fast (asm is not needed) – but it will require a constant table of 32 or 64 kilobytes.