Count repeated letters in a string c#

Question:

I'm doing an exercise for class, in which I have to count the letters that are repeated in a string (Without the user doing anything, simply count the letters of the same string, the times that they are repeated), but I can't get it to show on the screen, each letter only once, that is, it shows me the entire string, and I only want each letter to appear once. Also, I have to do a comparison so that the lowercase and uppercase letters are the same, and I can't do a .toupper, let's see if someone can help me, here is the code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;

namespace ConsoleApp8
{
    class Program
    {
        static void Main(string[] args)
        {
            try
            {
                using (StreamReader sr = new StreamReader("exercici1.txt")) {
                    string line;
                    char caracter = ' ';
                    int compt = 0;

                    while ((line = sr.ReadLine()) != null)
                    {
                        for (int x = 0; x < line.Length; x++)
                        {

                            for (int y = 0; y < line.Length; y++)
                            {
                                caracter = line[x];
                                if (caracter == line[y])
                                {
                                    compt++;
                                }
                            }
                            Console.WriteLine("El caracter " + caracter + " apareix " + compt + " vegades");
                            compt = 0;
                        }
                    }
                }
            }catch (Exception e)
            {
                Console.WriteLine("Error general");
                Console.WriteLine(e.Message);
                Environment.Exit(1);
            }
            Console.ReadKey();
        }
    }
}

Answer:

Keep count of characters found

The main problem is that you do not keep track of what characters you have found.

So, for example, if you take the string "aaa", in your loop…

for (int x = 0; x < line.Length; x++)
{
    // ...
}

You find that line[x] is 'a' on the first iteration, then 'a' on the second, and again 'a' on the third.

You need to keep track of what characters you have found.

Let's see what data structure works for you…

  • The order does not matter (you are not going to show the user the content).
  • You need to find elements by value (the character).
  • It will not have duplicates (you will not add a character twice)

What you need is a Set .

.NET gives us HashSet<T> and SortedSet<T> . Either one works.

These are the differences:

  • HashSet<T> exists since .NET 3.5, and SortedSet<T> was introduced in .NET 4.0
  • HashSet<T> is compact in memory, SortedSet<T> is not.
  • SortedSet<T> maintains the order of the data. HashSet<T> no.

Note: you do not need to maintain the order of the data. It's extra work that the computer would do unnecessarily, and it's functionality you would have to maintain (use IComparer ).


Why not use an array or a string ?

Because the search on an array or a string occurs in linear time ( O(n) ), that is, the computer has to compare element by element. It is less efficient than searching over a set that has logarithmic time ( O(log n) ).

I presume this is irrelevant to you, under the assumption that the size of the input data is small.

What linear time and logarithmic time mean is that the time it takes for the computer to perform the search is either a linear function (the computer takes x additional milliseconds for each item saved) or a logarithmic function (the computer takes x additional milliseconds for each order of magnitude of saved items).


There is an additional reason not to use an array. And it is a structure of fixed size. It would require you to know in advance how many different characters you're going to find… or make it big enough for the worst case (considering Unicode, we'd be talking over a million elements).


There is an additional reason not to use string . And is that string are immutable. You don't actually modify a string , but create a new one every time you do an operation. That means that every time you concatenate a character you are creating a new string .

This process creates garbage unnecessarily.

Again, this is not a problem if the input size is small (modern computers have lots of memory, after all). However, it is something worth knowing.


Compare ignoring case

The other thing you mention is that you have to make the comparison case insensitive.

For this you can use ToUpper however I do not recommend it. The reason is that the case conversion is not the same in all languages. The example for this case is Turkish, in which the uppercase version of i is İ instead of I .

Of course, you are not thinking that your code has to work on a computer in Turkish. However, it is good practice to take cultural differences into account when handling strings.

Addenum: for Spanish Ñ is a letter other than N , for other languages ​​it is an N with a diacritical mark.

There are six solutions:

  • Use ToUpper and assume that all comparisons are the language of the computer. Which can hide any flaws that may appear when your program has to run in another part of the world.

  • Use ToUpper and specify the culture. That is, use the ToUpper that receives CultureInfo as the second parameter.

  • Use ToUpperInvariant , which ignores language-specific rules.

  • Do not change case, instead convert to string and use String.Compare . Which has over loads receiving ignoreCase to compare ignoring case.

  • Do not change case, instead convert to string and use String.Compare . Which also has overloads that receive CultureInfo in case you want to specify it.

  • Don't change case, instead convert to string and use String.CompareOrdinal . Which has over loads receiving ignoreCase to compare ignoring case.

Now, between Ordinal and Invariant , you have to answer a question: are á and A the same letter? Your requirement speaks of capital letters, but not of accents. And these are different characters. If your program must treat a and á separately you want an Ordinal comparison.


Integrating the solution

Ok, we had already chosen data structure. It's a HashSet<T> . We need to tell it to ignore case. Since what HashSet<T> receives is an IEqualityComparer<T> , I'd say use StringComparer and that way you save yourself implementing that interface.

We also need to work around the drawback that HashSet<T> expects a collection if we want to stop an IEqualityComparer<T> . This is not a problem, we can pass an empty array.

Then the code would be:

var conjunto = new HashSet<string>(new string[]{}, StringComparer.OrdinalIgnoreCase);

THE

var conjunto = new HashSet<string>(new string[]{}, StringComparer.InvariantCultureIgnoreCase);

THE

var conjunto = new HashSet<string>(new string[]{}, StringComparer.CurrentCultureIgnoreCase);

Depending on whether you want ordinal comparison ( OrdinalIgnoreCase ), or if you don't, but want to ignore language-specific rules ( InvariantCultureIgnoreCase ), or want to use current language-specific rules ( CurrentCultureIgnoreCase ).


You can then use conjunto.Contains to check if an element has been added. And of course conjunto.Add to add.

Notice that conjunto.Add returns bool . This value tells you if the item is new. Which can save you a few lines of code.


Finally, your loops are the same:

for (int x = 0; x < line.Length; x++)
{
    for (int y = 0; y < line.Length; y++)
    {
        // ...
    }
    // ...
}

They both start at the beginning of the line ( 0 ) and end at the end of the line ( line.Length ).

This means that you are going to start by comparing the first character line[x] with the first character line[y] because in the first iteration both x and y are 0 ( int x = 0 and int y = 0 ).

You should only compare with the characters after the current one (because the previous ones you already compared).

Note: I am assuming that you should do the counts line by line of the file.


With the changes mentioned, this would be the code to use:

string line;
string caracter;
int compt = 0;
var comparador = StringComparer.InvariantCultureIgnoreCase;

while ((line = sr.ReadLine()) != null)
{
    var conjunto = new HashSet<string>(new string[]{}, comparador);
    for (int x = 0; x < line.Length; x++)
    {
        caracter = line[x].ToString();
        if (conjunto.Add(caracter))
        {
            var compt = 1; // <--- ya lo encontré, lo coloco en 1
            for (int y = x + 1; y < line.Length; y++)
            {
                if (comparador.Equals(caracter, line[y].ToString()))
                {
                    compt++;
                }
            }
            Console.WriteLine("El caracter " + caracter + " apareix " + compt + " vegades");
        }
    }
}

Grades:

  • I have changed caracter to type string . The reason is that we are doing the comparisons at the string level. This is because we are using StringComparer (there is no CharComparer available in BCL, and I have preferred not to explain how to implement IEqualityComparer<char> , this answer is long enough as it is).

  • When using string also leave caracter uninitialized. It is not what you need, however there is no point in giving it any value.

  • I have taken out a separate comparador , to be able to use it when counting characters. It makes sense that they are counted in the same way as checking if it has already been found.

  • I'm starting y at x + 1 , so y only checks from the next character that x is in.

  • It didn't make sense to give value to caracter in each cycle of y . It only makes sense to give value to caracter when x changes. For this reason I have removed it from the inner loop.

  • Inside the loop, since the current character is found, I can start the count at 1 .

  • Notice how conjunto is created and how Add is used in an if .

In fact, I think I can improve the code even more…

var comparador = StringComparer.InvariantCultureIgnoreCase;

while ((var line = sr.ReadLine()) != null)
{
    var conjunto = new HashSet<string>(new string[]{}, comparador);
    for (int x = 0; x < line.Length; x++)
    {
        var caracter = line[x].ToString();
        if (conjunto.Add(caracter))
        {
            var compt = 1; // <--- ya lo encontré, lo coloco en 1
            for (int y = x + 1; y < line.Length; y++)
            {
                if (comparador.Equals(caracter, line[y].ToString()))
                {
                    compt++;
                }
            }
            Console.WriteLine($"El caracter {caracter} apareix {compt} vegades");
        }
    }
}

Taking advantage of modern C# features.

Note: I have not tested this code.


An alternative

All of the above has been under the assumption that you should report the characters in the order you find them… if not… you can save in a data structure how many times you have found a character. So you loop through the string once (you would have a single loop to loop through the string , instead of two nested loops), adding each character you encounter to the count.

Let's see what data structure works for you…

  • The maybe matters (you are going to show the user the content)
  • You need to find elements by key (the character)
  • It will not have duplicates (you will not add a character twice)
  • You need key (the character) and value (the count).

The data structure is a dictionary. We have Dictionary<TKey, TValue> and SortedDictionary<TKey, TValue> . In this variant of the code, you're going to display the contents of the data structure… I'm going to assume that the order doesn't matter, since it makes things a bit easier.


while ((var line = sr.ReadLine()) != null)
{
    var diccionario = new Dictionary<string, int>(StringComparer.InvariantCultureIgnoreCase);
    for (int x = 0; x < line.Length; x++)
    {
        var caracter = line[x].ToString();
        if (diccionario.ContainsKey(caracter))
        {
            diccionario[caracter] ++;
        }
        else
        {
            diccionario.Add(caracter, 1);
        }
    }
    foreach (var entrada in diccionario)
    {
        Console.WriteLine($"El caracter {entrada.Key} apareix {entrada.Value} vegades");
    }
}

Grades:

  • I no longer need to compare to count characters, so I don't need to declare the comparador variable.

  • When I find a character, I check if it is already in the dictionary. If so, I increase the value. If it is not, I add it with value 1 .


separation of concerns

Your code handles three operations:

  • Access and read the file
  • Process character strings
  • Show the result to the user

If in the future you need to process multiple files, or perhaps process user input etc… you may want to separate these operations.

If in the future you need to make this code work in a web page or in a form, etc… you should separate these operations.

Even if you just want to change the language of the software, you shouldn't have to modify the code that is dedicated to counting characters.

Of course, this is academic work, and those things are not going to happen (or maybe they will, depending on the teacher). However, it's something you're going to run into in many forms (most commonly with the database) if you get to work as a programmer.

So I suggest you separate those operations. What is the criteria? Simple: Isolate the code that is responsible for interacting with external systems (eg the file system, the console) in such a way that it communicates with the rest of the software by exchanging values ​​(eg text strings).

This also makes the code easier to test. For example I can test the code that is responsible for counting characters without using a file.

Another advantage is that it makes it easy to divide the work among several programmers. Thus, one can be in charge of displaying the messages to the user, another of working with files and another of processing the text. It may not sound like an equitable division to you, but it is a way of dividing that prevents "stepping on the ditches" and is – assuming that everyone knows what they are doing – it is faster than letting all the work be done by one person.

I'm only interested in being able to test the code to count characters (so I can post an answer with confidence that it works), so I'm going to extract it. And I'm going to limit myself to that. This is the code:

while ((var line = sr.ReadLine()) != null)
{
    var diccionario = ContarCaracteres(line);
    foreach (var entrada in diccionario)
    {
        Console.WriteLine($"El caracter {entrada.Key} apareix {entrada.Value} vegades");
    }
}

// ...

static Dictionary<string, int> ContarCaracteres(string line)
{
    var diccionario = new Dictionary<string, int>(StringComparer.InvariantCultureIgnoreCase);
    for (int x = 0; x < line.Length; x++)
    {
        var caracter = line[x].ToString();
        if (diccionario.ContainsKey(caracter))
        {
            diccionario[caracter] ++;
        }
        else
        {
            diccionario.Add(caracter, 1);
        }
    }
    return diccionario;
}

I have tested the code of CountCharacters. I can say that it works.


Addendum

Flxtr has an important point (apart from Linq): We can take the string and convert it to uppercase (with ToUpper or ToUpperInvariant ) and then convert it to an array of characters (with ToCharArray ) so that we can directly compare the characters without the need for StringComparer .

This solution is more efficient (because we don't have to use ToString character by character):

static Dictionary<char, int> ContarCaracteres(string line)
{
    var diccionario = new Dictionary<char, int>();
    foreach (var caracter in line.ToUpperInvariant().ToCharArray())
    {
        if (diccionario.ContainsKey(caracter))
        {
            diccionario[caracter] ++;
        }
        else
        {
            diccionario.Add(caracter, 1);
        }
    }
    return diccionario;
}
Scroll to Top