Question:
I was faced with a task – the reverse of the md5 hash of 4 arbitrary bytes. That is, in other words, I need to write a service of 1 method – the method receives 16 bytes of hash at the input and returns 4 bytes at the output.
Since, as far as I remember, reversing the hash itself is not feasible, I decided to go the other way and generate hashes for all possible values. I have selected an entity, let's call it a запись
, which consists of 16 bytes of a hash (let's call it a ключ
) and 4 bytes of a value. Since the values of 4 bytes are ~ 4billion, I get 4billion records, each 20 bytes in size is 80 gigabytes of data.
As a first approximation, I generated 80 gigabytes, split into sorted chunks (each chunk is a separate file) and implemented the merging of files using the merge sort principle. Thus, in the end, I will get one large 80 gig file, where there will be 4 mld records sorted by key. I plan to create a second file using such a large file – an index file that can easily fit into the RAM. After that, searching for an entry by key will look like a binary search in a large file with a little help from the index file. The search time should generally be logarithmic, limited to a maximum of 32 hops.
The advantages of this solution:
 ease of deployment. I just deployed the service, waited for initialization, and everything works.
Minuses:
 It takes a very long time to generate the file. Time is measured in hours.
 The generated data takes up a lot of space – 80 gig
Based on the cons, I came up with a few ideas:

The first, of course, is to store the data in the database. They are immediately indexed and compressed there. I tried MongoDB and SQL Server, however, the insertion speed was so slow there and there that I got tired of waiting, and it was clearly slower than simple generation to a file. That is, I did not want to install a database for the sake of one database service, and it turns out that it would not have helped me much.

The second is to try to figure out how best to compress the data. To be honest, compressing the hash does not seem to me to be something elementary, and compressing the values will not work, since the full set of values is used there. There was an idea to split a large file into files similar to a trie – then each trie level would save me about 4 gigabytes + omit insignificant zeros in values (but this would greatly complicate the search)
It's quite difficult for me to ask the most specific question in my situation, but I'll try: Are there more generally accepted ways to solve such a problem? I have a persistent feeling that I am doing something wrong.
UPD So, I will summarize the current results.
Several approaches have been tried.
1. My initial version was to generate all 16 byte hashes into a file together with the initial IDs and then look for the hash in the file.
Generation time: approximately 34 hours. Search speed: not tested. Memory – 80GB
2. The second approach consisted in the fact that even if we take the first 4 bytes from the 16byte hash, there will be few collisions. Thus, it was possible to generate one large id file (sorted by hash) + a small index file. The search looked like this: we get the incoming hash, take the first 3 bytes from it, at this address in the index file (which can also be loaded into memory) is the initial offset in the large file. At this offset, we read 400 IDs – one of them is the desired one. We count 400 hashes, find the required id, and return the result.
Generation time: about 2 hours (I haven't done any optimizations yet). Index file – 70MB, file with IDs – 16GB. Search speed: about 75100 requests per second.
3. The third approach is to generate hash chains . The essence of the approach is that the hash chain can be represented only by the initial and final values. For my task, it looks like this:
We take a 4 byte id, generate a hash H
from it (we take the first 4 bytes), we get something like id>H
The hash is 4 bytes, which means that somewhere there is such an id
, which means I can take the hash from the hash and build a chain like id>H1>H2>Hn
. Only the beginning of the chain and the end can be written to the file. The beginning is needed to restore the chain, the end is needed to find the value by the hash.
The search itself looks like this: at the input we have a hash h
. We are looking for a chain with the end h
– if we find it, then we restore the chain, it contains the required id
. If there is no such chain, then we take the hash from our hash h1
, and we also look for a chain with the end of h1
. And so we repeat until there is a chain.
The svid approach has several advantages – it seems like you can represent data as a set of long chains (keeping only the beginning and end in memory). And the first problem is in the word "long". I will explain further.
So the cons:
 Obviously, if a chain contains an
id
that is already in another chain, the chains will start duplicating information.
It looks like this:
id1>H1>H2>H3>H4
id2>H5>H1>H2>H3
as you can see, there is a duplication of information, if the length of the chains is kept constant (which will increase the file generation time).
 To get rid of duplication, I allocated 2 ^ 32 bits (~ 500 mb) in memory, and marked the already processed IDs. Thus, 1 IT specialist was in only 1 chain. And then the problem with the chain length surfaced. The bottom line is that for the chain to be really long and useful, it is necessary that all the generated hashes inside it equal those IDs that have not yet been processed. When you start calculating these chains (I counted the chain until I came across the already processed
id
) they have an average length of 200 nodes. But this number is decreasing very quickly.
Let me give you an example: let's imagine that 10%
all ID specialists have been processed, and a new chain is being generated. Assuming that the hash function is evenly distributed, the chance that the first hash will be an id
that has not yet been processed is 0,9
. The chance that the second hash will be like this is 0,9*0,9=0,81
, and so on, and the chance that there will be at least 10
elements in the chain is 0.35
. And this is with the processed 10%
. With 80%
processed, the chance to have at least 2
elements in the chain (not counting the first id
) is 0,64
. With 99%
processed, there is no chance of a long chain, and a typical chain will consist of the initial id
and its hash. This is how it learned from me.
In total, the generation time is about an hour, the file size is 16GB. I didn't even try to test the search speed.
4. Fourth approach – rainbow tables (thanks to @AlexanderPetrov for the tip).
Rainbow tables are essentially a variation on hash chains. The difference with chains is that here the use of a hash function is interleaved with the use of different functions (reduction functions). Let me give you an example:
id > H1=H(id) > H2=R1(H1) > H3=H(H2) > H4=R2(H3) > H5=H(H4)
The example shows that the hash function is interleaved with different reduction functions. The advantage of this approach is that even if some value is the same in 2 chains, there will be duplication only if this value is the same in count from the beginning of the chain, for example
id1 > H1=H(id1) > H2=R1(H1) > H3=H(H2) > H4=R2(H3) > H5=H(H4)
id2 > H6=H(id2) > H7=R1(H6) > H3=H(H7) > H4=R2(H3) > H5=H(H4)
But if this value is in different places in the chains, then it will give different results, since it will be processed by different reduction functions
id1 > H1=H(id1) > H2=R1(H1) > H3=H(H2) > H4=R2(H3) > H5=H(H4)
id2 > H6=H(id2) > H7=R1(H6) > H2=H(H7) > H8=R2(H2) > H9=H(H8)
And everything seems to be fine, you can generate chains of arbitrary length. That's what I thought when I started generating chains of 3000 elements. And I very quickly calculated 30%
all values. For a tolerable time, I counted 50%
. But here the same rules come into play with the theory of probability that I described in the previous example. Let's imagine that I counted 95%
all IT people. Thus, a new chain of 3000
elements will contain 150
new elements and 2850
already counted ones. Thus, the calculation speed of the chains drops dramatically, as in each new chain 95%
elements are considered idle. I waited for about 30 hours, and could not stand it, I stopped this madness – about 98%
all Id occupied about 240
megabytes. But I was not satisfied with the 98%
accuracy, of course I need 100%
.
Due to the introduction of the reduction function, searching for id
by hash is slightly different from searching in the previous version. If in the previous version I just took the hash from the hash, then on the basis of the incoming hash it is necessary to build a new chain. Example: Let's say there is a chain
id1 > H1=H(id1) > H2=R1(H1) > H3=H(H2) > H4=R2(H3) > H5=H(H4)
And we received the H1
hash as input. We don't have chains that end in H1
, so we consider:
H(H1)  нет совпадений
R2(H(H1))  нет совпадений
H(R2(H(H1)))  нет совпадений
R1(H(R2(H(H1))))  на этом этапе результат выражения будет равен H5 элементу цепочки.
After the chain is found, we restore it and find the required id
. From practice – I waited for at least 1 id by hash for 10 minutes, I did not wait. But I suspect it’s not the algorithm, but I’m just a handshake. I fully admit that if you need to reverse the hash with some kind of error, then, in principle, the option is working.
In total, the generation time is very long, 98% in 30 hours, and the 98th percent was considered 56 hours, the file size was 240 MB. Search speed N / A.
As a current result , I have so far settled on option 2) – 16 gig file + index. I suspect this option will be the fastest. But I will not write off other options either.
UPD 2 I posted the sources of my troubles. The second approach is fully implemented, all generation takes 25 minutes, search for 1 key at a time – about 180 requests per second, search for a bunch of incoming real data from a real program – ~ 1000 requests per second (checked against 75 thousand available hashes).
Thank you all for your help!
Answer:
Let's estimate the minimum size of a possible index. With any indexing method, the key should take us 4 bytes of the value . To store the values themselves, 16 GB are required. Considering that the values in the index are arranged in the order of the keys, then they can be considered a set of random data with full use of all possible bits. No algorithm is capable of compressing such data without loss.
An additional index is not needed at all !!! We only need a 16 GB dictionary, which will contain the desired values themselves, sorted in the order of their MD5 hashes. When searching, based on the first 4 bytes of the hash, we estimate the approximate region of the file in which the desired value should be located. After building the dictionary and collecting statistics, it was possible to establish that it was possible to read the targeting block of 128Kb and the desired value would definitely be in it. The exact coordinates of the block in the file are obtained as follows:
const int bufSize=1024*128;
long offs = (((long)hash[0] << 24  (long)hash[1] << 16 
(long)hash[2] << 8  hash[3]) << 2)  bufSize/2;
if (offs < 0) offs = 0;
if (offs >= 0x400000000  bufSize) offs = 0x400000000  bufSize;
After reading this block, flock to its middle, calculate MD5 from the found value and check it against the desired one. Then we move by jumping in one direction or another in the dictionary region, by binary search, but not by the stored data, but by the MD5 sums calculated on the fly. We will reach the required value in a maximum of 20 "jumps".
Appendix: Algorithm for constructing a dictionary with insufficient memory
If you have less than 4 GB of memory available and you can't even count the number of 4byte keys right off the bat, then the following is a fairly quick solution:
We divide all processing into 2 stages. At the first stage, open 256 files from 00 to FF. We calculate the MD5 of all possible numbers and write to a file that coincides in number with the first byte of the hash, the value itself (4 bytes) and 4 bytes of the hash, starting from the second (since we know the first from the file). When allocating write buffers of 1 megabyte, the program at this stage requires about 280 MB of RAM. The stage for me, when implemented in C #, without special optimization, took about 2.5 hours when working on 1 core. The actual calculation of MD5 takes all the time and you can't get away from it without using several cores or GPUs.
The second stage is sorting each file separately and merging the values from them into the resulting dictionary file. In this case, we go in order of file numbers and therefore the data is initially sorted by itself by the first byte of the hash. No line comparison between files is required, read the file, sorted it, write it to the output, move on to the next one. What is the name of the sorting method that I used, I do not know, but it works like this: We read all 128 MB of the file into memory. We create an output buffer of 64 MB, a buffer of output keys of 64 MB, as well as a 16 MB array of bytes for key counters, a 64 MB array of offsets and another working array of current offsets in the block. In total, the program requires about 370 MB of memory. We sort in 2 passes, first we count the number of occurrences of each 3byte key (there are a maximum of 11 repetitions of one key, since the 4th byte of the key is the same as it is in the file name). Based on this, we calculate by addition and write the beginning of each list of values into the offset array. Next, we make the second pass through the input array, while rewriting the values into an array of values and keys into a separate array. When inserting elements at points where there can be more than 1 value per key, we use insertion sort, i.e. we find a value greater than the given one and move it and all subsequent ones back. Considering that maximum in one list there are only 11 values (and then there are about 20 such situations per file, i.e. out of 16 million) – this is fast. If all 4 bytes of the key from the file (plus the 5th from the file name) are the same, now the algorithm recalculates MD5 and compares the full hashes. There are about 32,000 such situations per 16 million records. If this place slows down, you can save most of the key in the first stage (but the files will grow in size and require more memory and more disk I / O). I completed this stage in 25 minutes. Again, with no optimization at all.
Final code . Please do not kick too much, for the first time I am writing in C # (I may need it soon, this task was convenient as "Hellow World"), I may not know basic things. Comments, suggestions, etc. (especially: "It's much easier to do this") are welcome 🙂
Appendix 2: Custom MD5 Calculation
Redesigning the C source code of the MD5 block calculation procedure from this article . Making a version sharpened strictly for calculating a hash for 4 bytes. Rather, the minimum block of 64 bytes is considered, as it should be according to the MD5 algorithm, but only 4 initial bytes change in the finished block. I got the following . The time for stage 1 of the algorithm, which calculates 4 billion MD5 sums, has decreased from 2 hours 40 minutes to 17 minutes (but with one Core i72600 core). And after the introduction of multithreading, the application began to run into the disk speed, while stage 1 goes through in 10 minutes, stage 2 in 12 minutes. In total, the total time for building a readytosearch dictionary is 22 minutes.