How to optimize tables/query in MySQL?

Question:

The table of genes of the human genome is small, only 60434 entries:

CREATE TABLE `genes-g38-201505` (
  `chr` varchar(2) NOT NULL,
  `left` bigint(20) NOT NULL,
  `right` int(11) NOT NULL,
  `Complement` int(11) NOT NULL,
  `Name` tinytext NOT NULL,
  `source` tinytext NOT NULL,
  `ENSEMBL` tinytext NOT NULL,
  `gene_version` tinytext NOT NULL,
  `gene_name` tinytext NOT NULL,
  `gene_source` tinytext NOT NULL,
  `gene_biotypeid` tinytext NOT NULL,
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  UNIQUE KEY `id_UNIQUE` (`id`)
) ENGINE=MyISAM;

The repeat table of the human genome is already worse – more than 5 and a half million entries:

CREATE TABLE `repeats-g38-201505` (
  `id` int(11) NOT NULL,
  `chr` varchar(2) DEFAULT NULL,
  `left` int(11) DEFAULT NULL,
  `right` int(11) DEFAULT NULL,
  `name` tinytext,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM;

These are two service tables. Of these, by and large, only chr is important to us – the name of the chromosome, left and right – the left and right coordinates of the whole gene / repeat or its part (there may be several parts, in this case several sets of {chr, left, right correspond to one name }) and name is the name of the gene/repeat.

All data for tables.

Now the data of experiments on the tissues of cancer patients. The table format is:

CREATE TABLE `51k-80-80-ignore-random-noreverse` (
  `chr` varchar(2) NOT NULL,
  `left` bigint(20) NOT NULL,
  `right` bigint(20) NOT NULL,
  `count` int(11) NOT NULL,
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  UNIQUE KEY `chr_left_right` (`chr`,`left`,`right`),
  `id` bigint(20) NOT NULL AUTO_INCREMENT
) ENGINE=MyISAM;

it is the same for every experiment. Each entry describes a DNA pattern belonging to the chr chromosome, with left and right coordinates, count. The number of records is different, from 4 to 7 and a half million per experiment. Each entry is a unique set of coordinates {chr, left, right}

And the final table in which you need to collect data from 4 experiments:

CREATE TABLE `pk47-pk51-gene-repeat` (
  `chr` varchar(2) NOT NULL,
  `left` bigint(20) NOT NULL,
  `right` bigint(20) NOT NULL,
  `count_k51` int(11) DEFAULT '0',
  `count_p51` int(11) DEFAULT '0',
  `count_p47` int(11) DEFAULT '0',
  `count_k47` int(11) DEFAULT '0',
  `name_left` varchar(29) NOT NULL,
  `name_right` varchar(17) NOT NULL,
  UNIQUE KEY `pos_name` (`chr`,`left`,`right`,`name_left`,`name_right`)
) ENGINE=MyISAM;

In fact, everything is simple: you need to find only those patterns that fall on the left side of the gene, and on the right side – on the repeat, count their number and display them in a pivot table. There seemed to be no problems with the query, I repeat such a query 4 times, changing only count_k51 to count_p51 and the source table itself:

INSERT INTO `pk47-pk51-gene-repeat` (
      `chr`,`left`, `right`,`count_k51`,`name_left`, `name_right`
)
SELECT 
     a.`chr`, a.`left`, a.`right`, a.`count` as `count_k51`,
     g.`name` as `name_left`, 
     r.`name` as `name_right` 
FROM `
     51k-80-80-ignore-random-noreverse` a, 
     `genes-g38-201505` g,
     `repeats-g38-201505` r 
WHERE
     a.`chr`=g.`chr` and a.`chr`=r.`chr` and 
     a.`left` < g.`right` and a.`left` > g.`left` and 
     a.`right` < r.`right` and a.`right` > r.`left` 
     on duplicate key 
     update 
       `pk47-pk51-gene-repeat`.`count_k51`=a.`count`;

On the first run, on duplicate key can be omitted. It is probably possible to make a single query instead of 4x, but it will be extremely cumbersome, and for now I decided to try this.

Of course, the request was not executed, fell off by a timeout, which I already increased so much. limit 0,1000; limit 1001,2000 and so on, as I understand it, it is useless to use, since each next stage the server will still go through the previous ones.
I decided to iterate requests by id , adding the restriction 20000*i< a.id <20000*(i+1) to the request, but the situation did not improve, apparently, id must be redefined, or the server should be forced to perform this check first.

As a result, we need ideas on how to optimize the query, rebuild tables or change the approach to solve this problem (not necessarily a pure SQL query, I can work with a database from programming languages). I would also like to thank you for the advice on the physical acceleration of the server: the memory on the machine is 32GB, the server uses little, maybe tweak some variables?

Update 1. Here are the EXPLAIN results for the query:

Initial state:

# id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
'1', 'SIMPLE', 'g', 'ALL', NULL, NULL, NULL, NULL, '60433', NULL
'1', 'SIMPLE', 'a', 'ref', 'chr_left_right', 'chr_left_right', '4', 'dna_homo_pairs.g.chr', '47216', 'Using index condition'
'1', 'SIMPLE', 'r', 'ALL', NULL, NULL, NULL, NULL, '5317291', 'Using where; Using join buffer (Block Nested Loop)'

Added indexes (chr, left, right) as suggested by @Mike:

# id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
'1', 'SIMPLE', 'a', 'ALL', 'chr_left_right', NULL, NULL, NULL, '4721638', NULL
'1', 'SIMPLE', 'g', 'ref', 'chr_left_right', 'chr_left_right', '4', 'methyl_base.a.chr', '604', 'Using index condition'
'1', 'SIMPLE', 'r', 'ref', 'chr_left_right', 'chr_left_right', '5', 'methyl_base.a.chr', '53172', 'Using index condition'

Update 2 . Make mysqld run in multiple threads.

Took a look at the CPU usage during the query. Since I'm currently working in exclusive mode, I'm the only one accessing the local server. Is it possible somehow to force mysqld to process one query in multiple threads? And then 8 cores / 16 threads at his disposal, but he uses only one.

By the way, spreading different tables into folders on different physical hard drives gives, albeit small, but acceleration of work.

Update 3 At the moment, I programmatically split all the source tables depending on which chromosome and the script (more precisely, a series of scripts that are also called programmatically) now looks like this (let's say the 7th chromosome is being processed):

INSERT INTO `pk47-pk51-gene-repeat` (
      `chr`,`left`, `right`,`count_k51`,`name_left`, `name_right`
)
SELECT 
     "7", a.`left`, a.`right`, a.`count` as `count_k51`,
     g.`name` as `name_left`, 
     r.`name` as `name_right` 
FROM `
     51k-80-80-ignore-random-noreverse-chr7` a, 
     `genes-g38-201505-chr7` g,
     `repeats-g38-201505-chr7` r 
WHERE
     a.`left` < g.`right` and a.`left` > g.`left` and 
     a.`right` < r.`right` and a.`right` > r.`left` 
     on duplicate key 
     update 
       `pk47-pk51-gene-repeat`.`count_k51`=a.`count`;

But I don't see much progress.

Answer:

The usual optimization techniques used by the DBMS usually take 1 record from the first table and search the entire second table for matching records. In the presence of an index, this happens quite quickly, in log2(m) , but still these are calls to a dozen pages of the index … Your query with a regular Join should be completed in O=n*(log2(m)+log2(k)) at best O=n*(log2(m)+log2(k)) (Where n,m,k is the number of records in 3 tables in the query) (although if you look at articles on MySQL optimization, it will generally come out O=n*log2(m)*log2(k) )

On the other hand, your data is quite specific (it's a pity the DBMS is not able to evaluate this). The genes and repeats tables contain non-intersecting intervals, therefore, when sorting by the left field, the records are sorted by right as well. In tables with patterns, the intervals intersect from time to time, as a result, when sorting by left , the right field sometimes "goes back" a little, but there are about 8% of such situations.

It is much easier to compare one sorted list with a sorted list of possible intervals by going through both lists simultaneously, moving the pointer forward in the list that has even fewer entries than the other. Moreover, it is possible to go through more than 2 lists in parallel at the same time. In this case, the complexity of the search turns out to be O=n+m+k .

Unfortunately, due to the fact that the list of patterns cannot be sorted by both left and right at the same time, we sometimes need to roll back the pointer in repeats a little back. In MySQL, when working with cursors, this is impossible, for this reason I decided to remember those records that fell out of the sequence of increasing right and at the same time could fall into the interval of the previous repeats record (there were very few of these 2k out of 438k). This can be avoided by remembering the last few repeats entries, but this would be cumbersome on MySQL. The saved records have to be processed in a second pass, going through repeats sorted by the right field.

Based on the above, I came up with the following stored procedure:

create table miss_rows(`left` int not null, `right` int not null,
                       `count` int not null,`name_left` varchar(100) not null);

drop procedure if exists genjoin;
delimiter $$
create procedure genjoin()
begin

 declare endOfData integer default 0;
 declare done int default 0;
 declare cnt int default 0;

 declare g_left int default 0;  -- Переменные для genes
 declare g_right int default 0;
 declare g_name varchar(100);
 declare gen_eof int default 0;  -- Признак конца genes

 declare r_left int default 0;   -- Переменные для repeats
 declare r_right int default 0;
 declare r_Oright int default 0; -- Значение правого конца предыдущей записи
 declare r_name varchar(100);
 declare rep_eof int default 0;  -- Признак конца repeats

 declare t_left int default 0;   -- Переменные рабочей таблицы
 declare t_right int default 0;
 declare t_Oright int default 0; -- Значение правого конца предыдущей записи
 declare t_count int default 0;

 declare gen_cur cursor for
  select `left`,`right`,`name` from genes order by 1,2;
 declare rep_cur cursor for
  select `left`,`right`,`name` from repeats order by 1,2;
 declare data_cur cursor for
  select `left`,`right`,`count` from t47k order by 1,2;

 declare miss_cur cursor for
  select `left`, `right`, `count`, `name_left` from miss_rows order by 2;
 declare rep2_cur cursor for
  select `left`,`right`,`name` from repeats order by 2;


 DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET endOfData=1;

 open gen_cur;
 open rep_cur;
 open data_cur;
 truncate table miss_rows;

 row: while done=0 do
    if t_Oright<t_right then set t_Oright=t_right; end if; -- Сохраняем предыдущее значение
   fetch data_cur into t_left, t_right, t_count;
   if endOfData=1 then set done=1,endOfData=0; end if;

   while t_left >= g_right and gen_eof=0 do
     fetch gen_cur into g_left, g_right, g_name;
     if endOfData=1 then set gen_eof=1,endOfData=0; end if;
   end while;

   while t_right >= r_right and rep_eof=0 do
     if r_Oright<r_right then set r_Oright=r_right; end if; -- Сохраняем предыдущее значение
     fetch rep_cur into r_left, r_right, r_name;
     if endOfData=1 then set rep_eof=1,endOfData=0; end if;
   end while;

   if t_left <= g_left or t_left >= g_right then iterate row; end if;
   if t_Oright > t_right then  -- Концы не по порядку, пересечение интервалов !
     if t_right < r_Oright then -- Мы могли попасть в предыдущую запись repeats !
       -- но пропустили ее ...
       insert into miss_rows values(t_left, t_right, t_count,g_name);
     end if;
   end if;
   if t_right <= r_left or t_right>= r_right then iterate row; end if;

   insert into `pk47-pk51-gene-repeat`
    (`left`, `right`, `count_k47`,`name_left`,`name_right`)
    values(t_left, t_right, t_count, g_name, r_name)
    on duplicate key update `count_k47`=t_count;

 end while;
 close gen_cur;
 close rep_cur;
 close data_cur;
 open rep2_cur;
 open miss_cur;
 set r_left=0,r_right=0,rep_eof=0,done=0;
 while done=0 do
   fetch miss_cur into t_left, t_right, t_count, g_name;
   if endOfData=1 then set done=1,endOfData=0; end if;

   while t_right >= r_right and rep_eof=0 do
     fetch rep2_cur into r_left, r_right, r_name;
     if endOfData=1 then set rep_eof=1,endOfData=0; end if;
   end while;

   if t_right > r_left then
     insert into `pk47-pk51-gene-repeat`
      (`left`, `right`, `count_k47`,`name_left`,`name_right`)
      values(t_left, t_right, t_count, g_name, r_name)
      on duplicate key update `count_k47`=t_count;
   end if;
 end while;
end$$

The execution time of this procedure on a control example of 6k / 444k / 438k records was 45 seconds … True, in the control example there was no chr field, when it is, it must of course also participate in sorting lists and be compared in all conditions. Tables 47, 51, etc. should be processed in separate passes. If it were possible to go back, one could try to process everything in one pass. A similar perl algorithm , taking data from sorted csv files, takes 4 seconds…

Due to the fact that there can still be interval intersections in repeats, everything is a little more complicated, and to get all the options (which are about 200 records more than the original option (in 130k records)) it’s better to use arrays so that you can look a little forward. in perl it looks like this

Scroll to Top