The task is to compare two sets of unique positive integers, and find those present in both. All are exactly in the range of 1 to 200 million. Typically, each of the two sets has 0 to 5 million numbers.
Until now, I do it "head-on": I put both sets in temporary MySQL tables. Two one-column tables, where numbers are primary keys. The comparison is fast if the sets are small and fit in
engine=MEMORY . Slow when tables are large and have to be created on disk. When it is necessary to carry out such comparisons in large quantities and often – the brakes.
What if you replicate MySQL indexed columns in native code? Keep one of the sets in memory, and check each element of the second for the presence of the first.
Do not store each of the numbers in the set (32bit, 2.5mln on average = 80MB), but work with a bit mask of all possible values. 200 million is, with a margin, 2 ^ 28 = 268,435,456 bits = 32MB. Installed – the number is in the set, 0 – not. Compare the set bits.
It is inefficient to store the entire set of bits for each set in full form. Surely, you can compress such data great. Most of the bits will be 0, which means that their sequences can be encoded with a number of them in a row, for example.
There will be only one question to such a compressed array: is there another desired number in the set, or not?
Let's simplify for an example. Let there be 32 values in total: 0..31. Our array will consist of 32 zeros / ones. The set contains only two values: 17 and 22.16 zeros, one, 4 zeros, 1. And write them as 16.4:
10000100 . Only 8 bits instead of 2 * 6. compression saved 25%. But this is my completely clubfoot idea of a possible compression method, without separators, units in a row, etc.
I need to know about the number 19, is it in the set? We go through our 8 bits: 16 is still less than 19, 4 more – already brute-force, the answer is "not in the set".
In your opinion, is there any point in such a bike, can it speed up the comparison of two sets, compared to MySQL?
Upd. Let me formulate the question more simply. Compression is sought for data when its parameters and limitations are known: only natural numbers from .. to .., not in a row, not sorted, without repetitions, the order is unimportant. And even without the need to unpack: you just need to be able to answer the question "is there such and such a number in the set, or not?"
1) DB is not for solving such problems
2) we use either ordered lists or hashtables
3) I would use an ordered list or array, and then compare using the pairwise merge method: – both pointers to the first elements
- we take an element from the first list,
- compare with an element of the second list
- if the values are equal, then we enter the element into the result list and increase both pointers and to items 1
- if 1 <2, we increase the value of the pointer of the first list
- otherwise the second
total algorithms used: 2 * qsort and pairwise merge
if the lists are very large … (20 million is quite tolerable for memory, but 200 may already be overkill), then they can be split into parts and compared first part 1, then part 2.
The algorithm is approximately the following: there are lists 1: A + B + C + D … and 2: A + B + C + D … – compare list 1A and 2A
- if list 1A is over, then compare the end of list 2A and list 1B
- otherwise the end of list 2B with list 1A
- as soon as which list ends the current one becomes the next of the given sequence (1 or 2)