Matching a regular expression at the end of a string in Python 2.7.13


I have the following string:

fo = "b---00b<do:YYYY>tftt_<fd>-<fd><ct><ct:MM>mmm.pdf"

And I just want to get mmm.pdf .

When I try:

match ='(>.*?\.pdf)', fo)

for g in match.groups():
    print g

I get:


I thought the symbol ? would make the search stop at > , but the pattern (>.*\.pdf) gives me the same result. What is the correct regular expression to get mmm.pdf ?

mmm.pdf can be abcs.pdf , qwerty123.pdf , etc. And fo always has the format:

fo = "someOptionalstring<otherstring>anotherOptionalString<string>optionalstring<string>mmm.pdf"

The alternation between strings (can be empty) and <strings> (not empty) can be in any amount. I was able to find regular expressions to extract the strings between <> , but not the string I want at the end.

I could use an algorithm using endswith() and looking for the last > character, but I want to try using regular expressions for learning purposes.

Edit: For those who are learning: I forgot to mention that you have to import the module re


*? is simply a "non-greedy" quantifier, with the expression >.*\.pdf what you are actually saying is something like this:

"Search for a substring beginning with the character > , followed by the fewest characters possible, and ending with .pdf "

If you delete ? then the quantifier is greedy:

"Search for a substring beginning with the character > , followed by as many characters as possible, and ending with .pdf "

It might be clearer with an example:

>>> import re

>>> fo = ">aaa.pdf>mmm.pdf"

>>>"(>.*?\.pdf)", fo).group()

>>>"(>.*\.pdf)", fo).group()

In your case there is no difference because the only possible match is >tftt_<fd>-<fd><ct><ct:MM>mmm.pdf in both cases, from the first occurrence of > until it finds .pdf , since there is only one .pdf substring in the string

One possibility is to simply use the expression [^>]+.pdf$ :

  • [^>]+ -> One or more characters other than >
  • $ -> Indicates the end of the string, this means that the match can only be at the end, for example <mmm.pdf>foo would not be a match.

import re 

regex = re.compile(r"[^>]+.pdf$")

fo = "b---00b<do:YYYY>tftt_<fd>-<fd><ct><ct:MM>mmm.pdf"

match =

Note that the search is done from left to right, so it will always try to match the pattern from the first occurrence of > . The idea behind the original expression if it could work using a backward lookup, for example we can install and use the regex package:

>>> import regex
>>> regex_exp = regex.compile(r"(?r)>(.*?\.pdf)")
>>> fo = "b---00b<do:YYYY>tftt_<fd>-<fd><ct><ct:MM>mmm.pdf"

Its syntax is very similar to the one used by re but it implements some features not present in it, for example it allows the release of the GIL and thus makes use of real multithreading.

Scroll to Top