Question:
We need a function that, for example, for a list:
[10, 10, 23, 10, 123, 66, 78, 123]
will return:
{10: 3, 123: 2}
How can this be done?
Answer:
Implementation
First way:
A = [10, 10, 23, 10, 123, 66, 78, 123]
counter = {}
for elem in A:
counter[elem] = counter.get(elem, 0) + 1
doubles = {element: count for element, count in counter.items() if count > 1}
print(doubles)
Second way:
from collections import Counter
counter = Counter(A)
Third way:
from collections import defaultdict
counter = defaultdict(int)
for elem in A:
counter[elem] += 1
Estimating the complexity of algorithms
The cost of compiling a counter list: you need to insert values into the dictionary n times. Insertion consists of two operations: first, checking whether there is such a number in the dictionary and, in fact, inserting – all together O (1) average or O (n) in the worst case for rare cases when all elements have the same hash. That is, the cost of compiling a counter is O (n) on average, O (n ^ 2) at worst.
The next step is to filter out only what you need. In the worst case, you need to go through the entire counter – again n operations in O (1), or in the worst O (n) – take from a dictionary, compare with one, write to a new dictionary. On average O (n).
The total is O (n) on average, or O (n ^ 2) for specially prepared data at worst.
Benchmark results
Update with a large array: A minute of measurements:
import timeit
non_Counter = \
"""counter = {}
for elem in A:
counter[elem] = counter.get(elem, 0) + 1"""
setup = "import random\n" \
"A = [random.randint(0, 100) for r in range(10**6)]"
print(timeit.repeat(non_Counter, setup=setup, number=10))
non_Counter = """Counter(A)"""
setup = "import random\n" \
"from collections import Counter\n"\
"A = [random.randint(0, 100) for r in range(10**6)]\n"
print(timeit.repeat(non_Counter, setup=setup, number=10))
non_Counter = \
"""counter = defaultdict(int)
for elem in A:
counter[elem] += 1"""
setup = "import random\n" \
"from collections import defaultdict\n" \
"A = [random.randint(0, 100) for r in range(10**6)]"
print(timeit.repeat(non_Counter, setup=setup, number=10))
Result:
[2.461800295429222, 2.456825704148736, 2.50377292183442]
[0.7278253601108151, 0.7268121314832463, 0.7283143209274385]
[1.3038446032438102, 1.3117127258723897, 1.3013156371393428]
As you can see from the results, the solution with Counter is the fastest.
Why such results
Explanation of the failure of the naive solution with a dictionary:
1) In order to get a value from a dictionary, you need a hash of the elem
variable. The hash value is needed twice: in order to get the previous value and in order to set a new one. Obviously, computing two hashes is doing double work. Measurements:
non_Counter = \
"""
args = [None, None]
for elem in A:
hash(elem)
hash(elem)"""
setup = "import random\n" \
"A = [random.randint(0, 100) for r in range(10**6)]\n" \
"counter = {}"
print(timeit.repeat(non_Counter, setup=setup, number=10))
[1.4283945417028974, 1.433934455438878, 1.4188164931286842]
As you can see, an extra calculation eats up 0.7 seconds or 30% of the total time. Unfortunately, there is no standard way to get a value from a dictionary by hash value. In the Counter class, the counting function is written at a lower level ( https://hg.python.org/cpython/file/tip/Modules/_collectionsmodule.c#l2204 ) and calls the _PyDict_GetItem_KnownHash, _PyDict_SetItem_KnownHash functions, which saves a lot of time. 2) Each time the get(elem, 0)
method is called, the LOAD_ATTR instruction ( https://hg.python.org/cpython/file/ed76eac4491e/Python/ceval.c#l2253 ) is called, which should find the required method by name. Since the method will not change, you can move the search for it out of a loop:
getter = counter.get
for elem in A:
counter[elem] = getter(elem, 0) + 1
[1.917134484341348, 1.9207427770511107, 1.9153613342431033]
We managed to save another 0.6 seconds.