Question:
there is a large text файл1
. For example:
a
b
c
d
e
there is another text файл2
with several lines. For example:
b
c
d
how to use posix utilities (or at least gnu utilities, and in extreme cases – on the most platform-non-specific C ) to find the line number in the first file, starting from which these files match? for the given example, this will be 2
(starting from the second line, the first file contains exactly the same lines as the second).
at the moment I found only a way to find out if the second file is included in the first one:
$ grep -qzP "$(sed ':a;N;$!ba;s/\n/\\n/g' файл2)" файл1 && echo входит
but it doesn't let you know the line number where the match started.
clarification about the sed program: it replaces each newline in файле2
with two \n
characters (backslash and n
) to make grep 's regular expression. borrowed from here: How can I replace a newline (\n) using sed? .
Answer:
With patch
if line=$(diff -U0 файл2 /dev/null | patch -f --dry-run файл1 - | sed -rn 's/^Hunk #1 succeeded at ([0-9]+) .*/\1/p; /FAILED/ q1')
then echo строка ${line:-1}
else echo фрагмент не найден
fi
We create a patch that removes all lines from the sample file and try to apply it to the second file in the test mode (dry-run). If possible, patch
reports the found offset if it is non-zero. If not, it reports an error.
With diff.
It was my first choice, I left it just in case. It seems to work, but I don’t know how it will behave with very large files:
diff -U0 файл2 файл1 | sed -rn '1!{/^-/q1}; 3{s/@@ -0,0 \+(1,([0-9]+)|(1)).*/\2\3/p}; /^@@ -[1-9]/ {G; h; /@@\n@@/ q1 }'
Only lines are numbered from 0.
It may be unnecessarily long to work with a large file, since diff
will output the entire файл1
that does not файл2
.
If there are no matches and if there is a partial match, it returns status 1, then the text should be ignored. If the files match from the beginning, nothing is output, status=0.
Program explanations.
diff -U0
produces a patch in unified format without context lines.
The first group catches lines starting with a minus, except for the first line. If such lines are present, then diff did not find any line from файл2
, the program exits with error code 1.
The second group catches the beginning of the fragment that diff thinks was added before lines from файл2
. From here the number of lines of this fragment undertakes. The title of this fragment should look like @@ -0,0 +1,n @@
, where n is the number of its lines. n equal to 1 is omitted.
If файл1
has all the lines from файл2
but there are more lines in between, in which case diff
will return more than two terms starting with @@
and no minus, the last group of commands keeps track of this.