python – How do I find all the duplicate items in a list and the number of times to repeat?

Question:

We need a function that, for example, for a list:

[10, 10, 23, 10, 123, 66, 78, 123] 

will return:

{10: 3, 123: 2}

How can this be done?

Answer:

Implementation

First way:

A = [10, 10, 23, 10, 123, 66, 78, 123]
counter = {}

for elem in A:
    counter[elem] = counter.get(elem, 0) + 1

doubles = {element: count for element, count in counter.items() if count > 1}

print(doubles)

Second way:

from collections import Counter
counter = Counter(A)

Third way:

from collections import defaultdict
counter = defaultdict(int)
for elem in A:
    counter[elem] += 1

Estimating the complexity of algorithms

The cost of compiling a counter list: you need to insert values ​​into the dictionary n times. Insertion consists of two operations: first, checking whether there is such a number in the dictionary and, in fact, inserting – all together O (1) average or O (n) in the worst case for rare cases when all elements have the same hash. That is, the cost of compiling a counter is O (n) on average, O (n ^ 2) at worst.

The next step is to filter out only what you need. In the worst case, you need to go through the entire counter – again n operations in O (1), or in the worst O (n) – take from a dictionary, compare with one, write to a new dictionary. On average O (n).

The total is O (n) on average, or O (n ^ 2) for specially prepared data at worst.

Benchmark results

Update with a large array: A minute of measurements:

import timeit

non_Counter = \
"""counter = {}
for elem in A:
    counter[elem] = counter.get(elem, 0) + 1"""

setup = "import random\n" \
        "A = [random.randint(0, 100) for r in range(10**6)]"

print(timeit.repeat(non_Counter, setup=setup, number=10))

non_Counter = """Counter(A)"""

setup = "import random\n" \
        "from collections import Counter\n"\
        "A = [random.randint(0, 100) for r in range(10**6)]\n"

print(timeit.repeat(non_Counter, setup=setup, number=10))

non_Counter = \
"""counter = defaultdict(int)
for elem in A:
    counter[elem] += 1"""

setup = "import random\n" \
        "from collections import defaultdict\n" \
        "A = [random.randint(0, 100) for r in range(10**6)]"

print(timeit.repeat(non_Counter, setup=setup, number=10))

Result:

[2.461800295429222, 2.456825704148736, 2.50377292183442]
[0.7278253601108151, 0.7268121314832463, 0.7283143209274385]
[1.3038446032438102, 1.3117127258723897, 1.3013156371393428]

As you can see from the results, the solution with Counter is the fastest.

Why such results

Explanation of the failure of the naive solution with a dictionary:

1) In order to get a value from a dictionary, you need a hash of the elem variable. The hash value is needed twice: in order to get the previous value and in order to set a new one. Obviously, computing two hashes is doing double work. Measurements:

non_Counter = \
"""
args = [None, None]
for elem in A:
    hash(elem)
    hash(elem)"""

setup = "import random\n" \
        "A = [random.randint(0, 100) for r in range(10**6)]\n" \
        "counter = {}"

print(timeit.repeat(non_Counter, setup=setup, number=10))
[1.4283945417028974, 1.433934455438878, 1.4188164931286842]

As you can see, an extra calculation eats up 0.7 seconds or 30% of the total time. Unfortunately, there is no standard way to get a value from a dictionary by hash value. In the Counter class, the counting function is written at a lower level ( https://hg.python.org/cpython/file/tip/Modules/_collectionsmodule.c#l2204 ) and calls the _PyDict_GetItem_KnownHash, _PyDict_SetItem_KnownHash functions, which saves a lot of time. 2) Each time the get(elem, 0) method is called, the LOAD_ATTR instruction ( https://hg.python.org/cpython/file/ed76eac4491e/Python/ceval.c#l2253 ) is called, which should find the required method by name. Since the method will not change, you can move the search for it out of a loop:

getter = counter.get
for elem in A:
    counter[elem] = getter(elem, 0) + 1

[1.917134484341348, 1.9207427770511107, 1.9153613342431033]

We managed to save another 0.6 seconds.

Scroll to Top