How to write a regular expression that concatenates labels based on substrings formed from a pattern, in Python?


Doing exercises of regular expressions I have been quite stuck in the following: I am proposed to make a function so that given a string and a second substring (which is going to act as a pattern) concatenate two "labels".


parámetro 1:    
“alsikjuyZB8we4 aBBe8XAZ piarBq8 Bq84Z ”

parámetro 2:    

sustrings formed only by characters contained in the second parameter:

“ZB”, “BB”, “XAZ”, "B", "B", “Z”

The result is:

“alsikjuy[target]ZB[endtarget]8we4 a[target]BB[endtarget]e8[target]XAZ[endtarget] piar[target]B[endtarget]q8 [target]B[endtarget]q84[target]Z[endtarget] ”

My program is the following:

import re
    def tagger(texto, patron):
        pattern = r"[patron]+"
        texto2 = re.sub(pattern, "[target]"+pattern+"[endtarget]", texto)
        return texto2

I am aware that the function is wrong and I suppose it is because of the substitution, but I don't know how to take those small parts of the pattern and concatenate these "tags" to them. I would appreciate any kind of help or explanation.


You were very close, but there are two major misconceptions in your code:

  • When you do pattern = r"[patron]+" you are not putting the pattern that you have received as a parameter there, but literally the letters of the word patron , so the pattern you would look for would be the one made up of one or more repetitions of those letters ( p , a , t , r , o , n ).

    To insert the parameter you can use the f-strings: pattern = f"[{patron}]+" or the .format() function or the string interpolation operator % .

  • In the substitution you want to put [target] / [endtarget] around the found text, but you are putting it around the string contained in patron . That string is fixed, since it is the search pattern but not the result of that search.

    To do what you want the expression in the second parameter of re.sub() must contain a special markup that represents the result found by the regular expression. This mark is \g<0> . You could also have included a capturing group in the pattern (put everything in parentheses) and used \1 to refer to the capturing in that group.

In short, with these changes your code would look like this:

import re
def tagger(texto, patron):
    pattern = f"[{patron}]+"
    texto2 = re.sub(pattern, r"[target]\g<0>[endtarget]", texto)
    return texto2


>>> tagger("alsikjuyZB8we4 aBBe8XAZ piarBq8 Bq84Z ", "XYZAB")
alsikjuy[target]ZB[endtarget]8we4 a[target]BB[endtarget]e8[target]XAZ[endtarget] piar[target]B[endtarget]q8 [target]B[endtarget]q84[target]Z[endtarget] 


The other option, using capturing groups, is to use parentheses within the pattern to mark which part of the regular expression you want to capture because it will be used later in the substitution pattern. In this particular example it doesn't make much sense, because we want the capture to be the entire pattern. It could still be done like this:

def tagger(texto, patron):
    pattern = f"([{patron}]+)"
    texto2 = re.sub(pattern, r"[target]\1[endtarget]", texto)
    return texto2

The pattern has a ( at the beginning and a ) at the end, so all of it is a capturing group (and the only one in this case, in other more general cases more parentheses could appear, even nested, within the pattern and they would be groups additional).

In the substitution expression, \1 is used to refer to the first captured group ( \2 to the second if any, etc.)

An example of how this can be useful. Imagine that we want to find and replace the same substrings, but only if they have a trailing 8 . In this case the pattern to look for would be [XYZAB]+8 , but we don't want to replace the 8, so the capturing group would leave the 8 out. So it would be like this:

    pattern = "([XYZAB]+)8"
    texto2 = re.sub(pattern, r"[target]\1[endtarget]", texto)
Scroll to Top