python – Fast search in 1 billion lines

Question:

There are several text files. Sizes from 50 MB to 5 GB. The total complexity is about 700-1000 million lines. All text files are in one folder, their names are known in advance.

Example file:

c4ca4238a0b9233d5fcc509a6f75849b.z001   6b3u3v65vu26u3bv6u
3d5f728d9d4c2f636f067f89cc14862c.z001   4cv5b7nm86ompmp9n6
ecc3d5fe4b5ce2fe28308fd9f2a7baf3.z001   1p864nv6rwye653565
a87ff679a2f3e71d9181a67b73d5f22c.z001   vb4484nb4646v46bvu
e4da3b7f3d5f2345d7772b0674a318d5.z001   vb75in86o486onb7i3
1679091c5a880faf6fb5e6087eb1b2dc.z001   v66v4664u64uvu6464
8f14e45fceea167a5a36dedd4bea2543.z001   4g6v6464buv64ubv64
c9f0f895fb98ab9159f51fd0297e236d.z001   54vb567n7u35gf6b6h
45c48cce2e2d7fbdea1afc51c7c6ad26.z001   g64u34u73h5635g636

It is necessary to find all lines that contain key characters and unload these lines into a separate file. The list of key symbols is in a separate file, keys.txt. You do not need to unload the entire line, only from 1 character to the first tab or space character.

Sample keys.txt file:

3d5f
0000
сссс

It is necessary to find all lines that contain key characters and unload these lines into a separate file. The list of key symbols is in a separate file, keys.txt. Their number can reach 100 pieces. You do not need to unload the entire line, only from 1 character to the first tab or space character.

Those. for my example, the correct output file should look like this:

c4ca4238a0b9233d5fcc509a6f75849b.z001
3d5f728d9d4c2f636f067f89cc14862c.z001
ecc3d5fe4b5ce2fe28308fd9f2a7baf3.z001
a87ff679a2f3e71d9181a67b73d5f22c.z001
e4da3b7f3d5f2345d7772b0674a318d5.z001

It is necessary to process such files as quickly as possible. It is desirable to work on both Windows and * nix systems.

Which algorithm should you use? Can anyone help write an example?

Answer:

As far as I understand, the search should be performed on each request once, so:

  1. Obviously, nothing is faster than reading files once and performing a search at the same time, nothing can be. However, you can play around with memory mapped files – in theory, this can increase the reading speed.

  2. You must search within each line. You can try to build search trees if regular search is not efficient enough, but I did not see this on the question. If there are not very many searched lines, they are not very long and are not specially formed to overwhelm the usual search algorithm, it should be enough.

  3. You can divide the program into 2 (or more, if necessary) threads: one reads the data, the others process them (if necessary in parallel). If all the data is on one physical disk, then it makes no sense to read in several streams, because on the contrary the disk will work slower. If on different, then you can parallelize on physical disks.

In general, this is all, then you need to implement and try on specific data.

Does it make sense to parallelize reading if the file is on an SSD?

Also no. The main thing is not to mess with the block size when reading (for example, use blocks of 1048576 bytes) and the SSD will give out its maximum speed per 1 read stream (there are measurements in the comments).

Scroll to Top