c# – Extract part of a string between two specified points

Question:

I recently found myself needing to extract all the values ​​between two specified points in a string, in this case everything inside the "()" parentheses.

What would be the most optimal or adequate way to do this?

string cadena = string.Empty, resultado = string.Empty;

I have an email that has a predefined format, in which only the values ​​between () change.

cadena example:

Hola, amigo X, ..........

bla bla bla bla
.......
('A','B','valorX','valorY',N...) //lo que quiero obtener.
.......
mas texto...
....

Se despide, atentamente, Pedro...

Looking for different ways to do it, I solved it using one of these ways presented below:

1- Using Split :

resultado = cadena.Split('(', ')')[1];

ó

resultado = cadena.Split("()".ToCharArray())[1]; 

2- With Regular Expressions Regex.Match :

resultado = Regex.Match(cadena, @"\(([^)]*)\)").Groups[1].Value;

3- With Substring applying a bit of math:

int posInicial = cadena.LastIndexOf("(") + 1;
int longitud = cadena.IndexOf(")") - posInicial;

resultado = cadena.Substring(posInicial, longitud);

Each of those ways of doing it yields the same result:

#resultado 'A','B','valorX','valorY',N...

Honestly, it's hard for me to understand how regular expressions work, I always see them as a bunch of indecipherable hieroglyphic code…

So: What would be the most optimal or appropriate way to do this?

Answer:

Just do a complexity analysis.

The most efficient algorithm in terms of memory and speed would be the fourth. Basically you have to look at the linear time and memory consumption of each algorithm.

In the first algorithm:

cadena.Split('(', ')')[1];

The string is iterated in linear time, looking for the number of characters given in the Split array (passed as parameters in the method) and for each one iterates the list up to N , where N is the length of the string. Now, he will need to run the list and create M temporary variables for each character in the Split , then create a list of values ​​by indexing which is accessed in O(1) constant time.

As a result you will get O((N * M) + 1) where N is the length of the string and M the number of substrings generated in each Split operation.

The second algorithm:

cadena.Split("()".ToCharArray())[1];

It is basically the same procedure as the first algorithm, only here, it will consume more memory, because it will have to create an array of characters and create a temporary variable and iterate the string , which in this case is "()" .

The third algorithm:

Regex.Match(cadena, @"\(([^)]*)\)").Groups[1].Value;

It is a double-edged sword. The complexity will lie in the length or complexity of the rule, forgive the redundancy. This should only be used if the rule is a bit complex, validating emails, addresses, number formats, mentions and hashtags, etc… For example, if you were not going to use Regex to validate mentions or hashtags in a string, you would have to create a gigantic algorithm and Tree of Intervals to obtain the indices where each mention or hashtag is found. To work with strings of massive amounts, you would spend a lot of memory trying to get all the substrings that are mentions or hashtags into giant strings. Regular expressions should be used as a validator for complex strings, as they save you from creating a gigantic algorithm. Obviously in this case, it is the one with the greatest complexity and memory consumption.

For the fourth algorithm:

int posInicial = cadena.LastIndexOf("(") + 1;
int longitud = cadena.IndexOf(")") - posInicial;
resultado = cadena.Substring(posInicial, longitud);

You would have to iterate twice the length N of the string to then get the result in N so the complexity would be O((2 * N) + N) .

So in a top it would be:

  1. O((2 * N) + N) the fourth algorithm.
  2. O((N * M) + 1) the first algorithm.
  3. O((N * M) + 1) the second algorithm. The first algorithm consumes more memory.
  4. O(?) the fourth algorithm. Regex is the most complicated and the one that consumes more memory. Beforehand, it can be known which is the one with the greatest complexity due to the process that it implies.

Keep in mind that in your example these times are insignificant (none reach 1ms of processing). So if you want to see the result in a better way, you would have to try it with a giant length for the chain). This answer is based on my experience in the algorithm, if someone is willing to document and contradict me or find an error, I am available to discuss it.

You can read the documentation for the analysis of Algorithms Understanding Big O Notation or This link is more complete .

Scroll to Top