2

I have a piece of code that loops through a set of nodes and counts the path length connecting the given node to each other node in my network. For each node my code returns me a list, b containing integer values giving me the path length for every possible connection. I want to count the number of occurences of given path lengths so I can create a histogram.

local_path_length_hist = {}
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    for dist in b:
        if dist in local_path_length_hist:
            local_path_length_hist[dist]+=1
        else:
            local_path_length_hist[dist]=1

This presumably is very crude coding as far as the dictionary update is concerned. Is there a better way of doing this? What is the most efficient way of creating this histogram?

4

4 回答 4

1

The check that element exists in dict is not really necessary. You can just use collections.defaultdict. Its initialization accepts callable object (like function) that will be called if you want to access (or assign something to) element that does not exist to generate the value (i.e. function that generates default value). For your case, it can be just int. I.e.

import collections
local_path_length_hist = collections.defaultdict(int)
# you could say collections.defaultdict(lambda : 0) instead
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    for dist in b:
        local_path_length_hist[dist] += 1

You could turn the last two lines in one like that, but there is really no point.

于 2016-09-01T16:37:40.220 回答
1

Since gt.shortest_distance returns an ndarray, numpy math is fastest:

max_dist = len(vertices) - 1
hist_length = max_dist + 2
no_path_dist = max_dist + 1
hist = np.zeros(hist_length) 
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    hist += np.bincount(dist.a.clip(max=no_path_dist))

I use the ndarray method clip to bin the 2147483647 values returned by gt.shortest_distance at the last position of hist. Without use of clip, hist's size would have to be 2147483647 + 1 on 64-bit Python, or bincount would produce a ValueError on 32-bit Python. So the last position of hist will contain a count of all non-paths; you can ignore this value in your histogram analysis.


As the below timings indicate, using numpy math to obtain a histogram is well over an order of magnitude faster than using either defaultdicts or counters (Python 3.4):

# vertices      numpy    defaultdict    counter
    9000       0.83639    38.48990     33.56569
   25000       8.57003    314.24265    262.76025
   50000      26.46427   1303.50843   1111.93898

My computer is too slow to test with 9 * (10**6) vertices, but relative timings seem pretty consistent for varying number of vertices (as we would expect).


timing code:

from collections import defaultdict, Counter
import numpy as np
from random import randint, choice
from timeit import repeat

# construct distance ndarray such that:
# a) 1/3 of values represent no path
# b) 2/3 of values are a random integer value [0, (num_vertices - 1)]
num_vertices = 50000
no_path_length = 2147483647
distances = []
for _ in range(num_vertices):
    rand_dist = randint(0,(num_vertices-1))
    distances.append(choice((no_path_length, rand_dist, rand_dist)))
dist_a = np.array(distances)

def use_numpy_math():
    max_dist = num_vertices - 1
    hist_length = max_dist + 2
    no_path_dist = max_dist + 1
    hist = np.zeros(hist_length, dtype=np.int)
    for _ in range(num_vertices):
        hist += np.bincount(dist_a.clip(max=no_path_dist))

def use_default_dict():
    d = defaultdict(int)
    for _ in range(num_vertices):
        for dist in dist_a:
            d[dist] += 1

def use_counter():
    hist = Counter()
    for _ in range(num_vertices):
        hist.update(dist_a)

t1 = min(repeat(stmt='use_numpy_math()', setup='from __main__ import use_numpy_math',
                repeat=3, number=1))
t2 = min(repeat(stmt='use_default_dict()', setup='from __main__ import use_default_dict',
                repeat= 3, number=1))
t3 = min(repeat(stmt='use_counter()', setup='from __main__ import use_counter',
                repeat= 3, number=1))

print('%0.5f, %0.5f. %0.5f' % (t1, t2, t3))
于 2016-09-01T16:39:52.120 回答
1

There is a utility in the collections module called Counter. This is even cleaner than using a defaultdict(int)

from collections import Counter
hist = Counter()
for ver in vertices:
    dist = gt.shortest_distance(g, source=g.vertex(ver))
    a = dist.a
    #Delete some erroneous entries
    b = a[a!=2147483647]
    hist.update(b)
于 2016-09-02T12:46:58.190 回答
0

I think you can bypass this code entirely. Your question is tagged with . Take a look at this section of their documentation: graph_tool.stats.vertex_hist.

Excerpt from linked documentation:

graph_tool.stats.vertex_hist(g, deg, bins=[0, 1], float_count=True)
Return the vertex histogram of the given degree type or property.

Parameters:
g : Graph  Graph to be used.
deg : string or PropertyMap
 Degree or property to be used for the histogram. It can be either “in”, “out” or “total”, for in-,
 out-, or total degree of the vertices. It can also be a vertex property map.
bins : list of bins (optional, default: [0, 1])
 List of bins to be used for the histogram. The values given represent the edges of the bins
 (i.e. lower and upper bounds). If the list contains two values, this will be used to automatically
 create an appropriate bin range, with a constant width given by the second value, and starting
 from the first value.
float_count : bool (optional, default: True)
 If True, the counts in each histogram bin will be returned as floats. If False, they will be
 returned as integers.

Returns: counts : ndarray
 The bin counts.
bins : ndarray
 The bin edges.

This will return the edges grouped like a histogram in an ndarray. You can then just get the length of the ndarray columns to get your counts to generate the histogram.

于 2016-09-01T23:20:22.990 回答