Question:
I have the following string:
fo = "b---00b<do:YYYY>tftt_<fd>-<fd><ct><ct:MM>mmm.pdf"
And I just want to get mmm.pdf
.
When I try:
match = re.search(r'(>.*?\.pdf)', fo)
for g in match.groups():
print g
I get:
>tftt_<fd>-<fd><ct><ct:MM>mmm.pdf
I thought the symbol ?
would make the search stop at >
, but the pattern (>.*\.pdf)
gives me the same result. What is the correct regular expression to get mmm.pdf
?
mmm.pdf
can be abcs.pdf
, qwerty123.pdf
, etc. And fo
always has the format:
fo = "someOptionalstring<otherstring>anotherOptionalString<string>optionalstring<string>mmm.pdf"
The alternation between strings
(can be empty) and <strings>
(not empty) can be in any amount. I was able to find regular expressions to extract the strings between <>
, but not the string I want at the end.
I could use an algorithm using endswith()
and looking for the last >
character, but I want to try using regular expressions for learning purposes.
Edit: For those who are learning: I forgot to mention that you have to import the module re
Answer:
*?
is simply a "non-greedy" quantifier, with the expression >.*\.pdf
what you are actually saying is something like this:
"Search for a substring beginning with the character
>
, followed by the fewest characters possible, and ending with
If you delete ?
then the quantifier is greedy:
"Search for a substring beginning with the character
>
, followed by as many characters as possible, and ending with
It might be clearer with an example:
>>> import re
>>> fo = ">aaa.pdf>mmm.pdf"
>>> re.search(r"(>.*?\.pdf)", fo).group()
'>aaa.pdf'
>>> re.search(r"(>.*\.pdf)", fo).group()
'>aaa.pdf>mmm.pdf'
In your case there is no difference because the only possible match is >tftt_<fd>-<fd><ct><ct:MM>mmm.pdf
in both cases, from the first occurrence of >
until it finds .pdf
, since there is only one .pdf
substring in the string
One possibility is to simply use the expression [^>]+.pdf$
:
-
[^>]+
-> One or more characters other than>
-
$
-> Indicates the end of the string, this means that the match can only be at the end, for example<mmm.pdf>foo
would not be a match.
import re
regex = re.compile(r"[^>]+.pdf$")
fo = "b---00b<do:YYYY>tftt_<fd>-<fd><ct><ct:MM>mmm.pdf"
match = regex.search(fo)
print(match.group())
Note that the search is done from left to right, so it will always try to match the pattern from the first occurrence of >
. The idea behind the original expression if it could work using a backward lookup, for example we can install and use the regex
package:
>>> import regex
>>> regex_exp = regex.compile(r"(?r)>(.*?\.pdf)")
>>> fo = "b---00b<do:YYYY>tftt_<fd>-<fd><ct><ct:MM>mmm.pdf"
>>> regex_exp.search(fo).groups()
('mmm.pdf',)
Its syntax is very similar to the one used by re
but it implements some features not present in it, for example it allows the release of the GIL and thus makes use of real multithreading.