c – find position of multiline text

Question:

there is a large text файл1 . For example:

a
b
c
d
e

there is another text файл2 with several lines. For example:

b
c
d

how to use posix utilities (or at least gnu utilities, and in extreme cases – on the most platform-non-specific C ) to find the line number in the first file, starting from which these files match? for the given example, this will be 2 (starting from the second line, the first file contains exactly the same lines as the second).


at the moment I found only a way to find out if the second file is included in the first one:

$ grep -qzP "$(sed ':a;N;$!ba;s/\n/\\n/g' файл2)" файл1 && echo входит

but it doesn't let you know the line number where the match started.

clarification about the sed program: it replaces each newline in файле2 with two \n characters (backslash and n ) to make grep 's regular expression. borrowed from here: How can I replace a newline (\n) using sed? .

Answer:

With patch

if line=$(diff -U0 файл2 /dev/null | patch -f --dry-run файл1 - | sed -rn 's/^Hunk #1 succeeded at ([0-9]+) .*/\1/p; /FAILED/ q1')
then echo строка ${line:-1}
else echo фрагмент не найден
fi

We create a patch that removes all lines from the sample file and try to apply it to the second file in the test mode (dry-run). If possible, patch reports the found offset if it is non-zero. If not, it reports an error.

With diff.

It was my first choice, I left it just in case. It seems to work, but I don’t know how it will behave with very large files:

diff -U0 файл2 файл1 | sed -rn '1!{/^-/q1}; 3{s/@@ -0,0 \+(1,([0-9]+)|(1)).*/\2\3/p}; /^@@ -[1-9]/ {G; h; /@@\n@@/ q1 }'

Only lines are numbered from 0.

It may be unnecessarily long to work with a large file, since diff will output the entire файл1 that does not файл2 .

If there are no matches and if there is a partial match, it returns status 1, then the text should be ignored. If the files match from the beginning, nothing is output, status=0.

Program explanations.

diff -U0 produces a patch in unified format without context lines.

The first group catches lines starting with a minus, except for the first line. If such lines are present, then diff did not find any line from файл2 , the program exits with error code 1.

The second group catches the beginning of the fragment that diff thinks was added before lines from файл2 . From here the number of lines of this fragment undertakes. The title of this fragment should look like @@ -0,0 +1,n @@ , where n is the number of its lines. n equal to 1 is omitted.

If файл1 has all the lines from файл2 but there are more lines in between, in which case diff will return more than two terms starting with @@ and no minus, the last group of commands keeps track of this.

Scroll to Top