Question:
I need to extract data from a text and am trying to do this using grep. But the way to use regular expressions with this command is quite different from what is usually done in Ruby or JavaScript, and I'm not getting what I need to do. In the following text:
Judicial Record of the Regional Labor Court of the 1st Region
ELECTRONIC JOURNAL OF LABOR JUSTICE JUDICIAL POWER
No.1697/2015
FEDERATIVE REPUBLIC OF BRAZIL
Release date: Wednesday, April 1, 2015.
Regional Labor Court of the 1st Region
I only need to get the number that can be seen on the third line. This number will later be used to make a request to a webservice. Tried with grep as follows:
pdftotext Diario_1697_1_1_4_2015.pdf -f 1 -l 1 - | grep -o /Nº(\d+\/\d+)/
I take the first page of a pdf file, convert it into txt and pass it to the grep command to extract the information. But that doesn't work at all. Does anyone know how to correctly do this with grep or some other bash command?
Answer:
First, grep is a shell command and its arguments are simple strings like any others. Instead of enclosing the regex with /
you should use single quotes (or use double quotes if you are careful with shell variable expansion). Also, you need to escape your backslashes with \\
.
Second, grep's default regex syntax is a little different and quite weak. For example, she doesn't understand the +
, only the *
. You can switch to Perl syntax with the -P
flag
grep -P -o 'Nº\\d+/\\d+'
or use the POSIX syntax with grep -E
or egrep
.
grep -E -o 'Nº[[:digit:]]+/[[:digit:]]+'
grep -E -o 'Nº[0-9]+/[0-9]+'