regular-expressions – How to split a string by regex in python

Question:

Good day! Please tell me how to create a regular expression and split a string like this:

Ivanov Ivan Ivanovich 02/12/1942 675195, Moscow, st. Ivanovs, house 15, 4512 125345, issued by the Order of the Ivanovs on 11.11.2011.

Accordingly, you need to split it so that you get the columns: name, date, address, passport

I tried like this:

pattern = "[А-Я]*[0-9]."
df1 = df1.Name.str.split(pattern, expand=True)

but it turns out crooked. I'm sure there is a much better option.

Answer:

Since the text is unstructured, only regular expressions will help here. Example:

rx = r'^(?P<Name>.*?)\s+(?P<Date>\d{2}\.\d{2}\.\d{4})\s+(?P<Address>\d+,\s*.*?)\s+(?P<Passport>\d{4}\s\d{6}.*)$'

See demo at regex101.com

  • ^ – start of line
  • (?P<Name>.*?) – any 0+ characters (as few as possible)
  • \s+ – 1+ spaces
  • (?P<Date>\d{2}\.\d{2}\.\d{4}) – 2 digits, dot, 2 digits, dot, 4 digits
  • \s+ – 1+ spaces
  • (?P<Address>\d+,\s*.*?) – 1+ digits , , 0+ spaces, any characters 0+ (as little as possible)
  • \s+ 1+ spaces
  • (?P<Passport>\d{4}\s\d{6}.*) – 4 digits, space, 6 digits and 0+ any characters (as many as possible, to the end of the line)
  • $ – end of line

For this regex to work in pandas, you need to use it with extract :

df1 = df1.Name.str.extract(rx, expand=True)
Scroll to Top