8

I have a problem with counting distinct values for each key in Python.

I have a dictionary d like

[{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]

I need to print number of distinct values per each key individually.

That means I would want to print

abc 3
xyz 1
pqr 4

Please help.

Thank you

4

6 回答 6

13

Over 6 years after answering, someone pointed out to me I misread the question. While my original answer (below) counts unique keys in the input sequence, you actually have a different count-distinct problem; you want to count values per key.

To count unique values per key, exactly, you'd have to collect those values into sets first:

values_per_key = {}
for d in iterable_of_dicts:
    for k, v in d.items():
        values_per_key.setdefault(k, set()).add(v)
counts = {k: len(v) for k, v in values_per_key.items()}

which for your input, produces:

>>> values_per_key = {}
>>> for d in iterable_of_dicts:
...     for k, v in d.items():
...         values_per_key.setdefault(k, set()).add(v)
...
>>> counts = {k: len(v) for k, v in values_per_key.items()}
>>> counts
{'abc': 3, 'xyz': 1, 'pqr': 4}

We can still wrap that object in a Counter() instance if you want to make use of the additional functionality this class offers, see below:

>>> from collections import Counter
>>> Counter(counts)
Counter({'pqr': 4, 'abc': 3, 'xyz': 1})

The downside is that if your input iterable is very large the above approach can require a lot of memory. In case you don't need exact counts, e.g. when orders of magnitude suffice, there are other approaches, such as a hyperloglog structure or other algorithms that 'sketch out' a count for the stream.

This approach requires you install a 3rd-party library. As an example, the datasketch project offers both HyperLogLog and MinHash. Here's a HLL example (using the HyperLogLogPlusPlus class, which is a recent improvement to the HLL approach):

from collections import defaultdict
from datasketch import HyperLogLogPlusPlus

counts = defaultdict(HyperLogLogPlusPlus)

for d in iterable_of_dicts:
    for k, v in d.items():
        counts[k].update(v.encode('utf8'))

In a distributed setup, you could use Redis to manage the HLL counts.


My original answer:

Use a collections.Counter() instance, together with some chaining:

from collections import Counter
from itertools import chain

counts = Counter(chain.from_iterable(e.keys() for e in d))

This ensures that dictionaries with more than one key in your input list are counted correctly.

Demo:

>>> from collections import Counter
>>> from itertools import chain
>>> d = [{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})

or with multiple keys in the input dictionaries:

>>> d = [{"abc":"movies", 'xyz': 'music', 'pqr': 'music'}, {"abc": "sports", 'pqr': 'movies'}, {"abc": "music", 'pqr': 'sports'}, {"pqr":"news"}, {"pqr":"sports"}]
>>> Counter(chain.from_iterable(e.keys() for e in d))
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})

A Counter() has additional, helpful functionality, such as the .most_common() method that lists elements and their counts in reverse sorted order:

for key, count in counts.most_common():
    print '{}: {}'.format(key, count)

# prints
# 5: pqr
# 3: abc
# 1: xyz
于 2013-05-06T20:13:01.630 回答
5

No need of using counter. You can achieve in this way:

# input dictionary
d=[{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]

# fetch keys
b=[j[0] for i in d for j in i.items()]

# print output
for k in list(set(b)):
    print "{0}: {1}".format(k, b.count(k))
于 2013-05-06T21:41:32.467 回答
3

What you're describing--a list with multiple values for each key--would be better visualized by something like this:

{'abc': ['movies', 'sports', 'music'],
 'xyz': ['music'],
 'pqr': ['music', 'movies', 'sports', 'news']
}

In that case, you have to do a bit more work to insert:

  1. Lookup key to see if it already exists
    • If doesn't exist, create new key with value [] (empty list)
  2. Retrieve value (the list associated with the key)
  3. Use if value in to see if the value being checked exists in the list
  4. If the new value isn't in, .append() it

This also leads to an easy way to count the total number of elements stored:

# Pseudo-code
for myKey in myDict.keys():
    print "{0}: {1}".format(myKey, len(myDict[myKey])
于 2013-05-06T20:13:54.020 回答
2
>>> d = [{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"},
... {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, 
... {"pqr":"sports"}]
>>> from collections import Counter
>>> counts = Counter(key for dic in d for key in dic.keys())
>>> counts
Counter({'pqr': 5, 'abc': 3, 'xyz': 1})
>>> for key in counts:
...     print (key, counts[key])
...
xyz 1
abc 3
pqr 5
于 2013-05-06T20:12:13.537 回答
1

Use a collections.Counter. Assuming that you have a list of one item dictionaries...

from collections import Counter
listOfDictionaries = [{'abc':'movies'}, {'abc':'sports'}, {'abc':'music'},
    {'xyz':'music'}, {'pqr':'music'}, {'pqr':'movies'},
    {'pqr':'sports'}, {'pqr':'news'}, {'pqr':'sports'}]
Counter(list(dict)[0] for dict in zzz)
于 2013-05-06T20:16:27.223 回答
1

Building on @akashdeep solution which uses the set but gives a wrong result because is not counting for the "distinct" requirement mentioned in the question (pqr should be 4, not 5).

# dictionary
d=[{"abc":"movies"}, {"abc": "sports"}, {"abc": "music"}, {"xyz": "music"}, {"pqr":"music"}, {"pqr":"movies"},{"pqr":"sports"}, {"pqr":"news"}, {"pqr":"sports"}]

# merged dictionary
c = {}
for i in d:
    for k,v in i.items():
        try:
            c[k].append(v)
        except KeyError:
            c[k] = [v]

# counting and printing
for k,v in c.items():
    print "{0}: {1}".format(k, len(set(v)))

This will give the correct:

xyz: 1
abc: 3
pqr: 4
于 2016-02-19T23:59:36.123 回答