0

Disclaimer: I have fairly straightforward/brute-force ways of solving my questions; the goal of the question is to learn better approaches and libraries that assist with these calculations.

I have a fairly sizeable csv (100k+ lines) with people, location, time data, and money spent, amongst other things. Say, something like:

thomas, park, noon, 0
jim, pool, afternoon, 5
sandy, school, noon, 0
alex, mall, night, 20

As I approach this corpus of data a few things of interest I'd like to discover, and the ways I'd go about doing them. Currently I implement things with a blend of R and Python (and RPy2).

  1. Most active people? Most visited places? Most busy time? Easy tally of occurrences that I tally via a for loop.
  2. Similarity -- people who visit X also visit Y -- given a subset of people who visit a park, what are the other locations they visit? Can be applied to the other dimensions as well. Currently I implement this by iterating through the subset and tally things up. What's better?

    slight digression for 3-4; found libraries but would love to hear better approaches/libraries

  3. Visualization via network graphs to see clusters/concentrations -- each person is defined as a vertex and the shared location is defined as a colored edge. Preprocessing of the data is kinda a pain due to my data format; I can also "cheat" by having edges be both people and locations+time since that's involves less preprocessing. Currently using weighted graph in R (igraph library).
  4. Cluster analysis to see if data falls into certain bins; right now I'm just using k-means clustering.

So to reiterate, given the nature of my inquiries, what would be good libraries that have prebuilt and optimized functions to answer some of my questions? It just seems that using a bunch of for loops is a really inefficient and inelegant way to gather insight.

4

1 回答 1

1

Python 有很多内置的好东西。

假设您将数据存储在元组列表中。(我认为实际上使用collections.namedtuple会使代码更容易理解)。借助推导,您可以构建单个项目的列表。然后你可以使用collections.Counter来计算它们

In [1]: import collections

In [2]: Record = collections.namedtuple('Record', ['person', 'location', 'time', 'amount'])

In [3]: allrecords = []

您应该在此处从 CSV 文件中读取记录...

In [4]: allrecords.append(Record('thomas', 'park', 'noon', 0))

In [5]: allrecords.append(Record('jim', 'pool', 'afternoon', 5))

In [6]: allrecords.append(Record('sandy', 'school', 'noon', 0))

In [7]: allrecords.append(Record('alex', 'mall', 'night', 20))

现在您可以过滤数据;

In [8]: times = collections.Counter([j.time for j in allrecords])

In [9]: print times
Counter({'noon': 2, 'afternoon': 1, 'night': 1})

In [10]: amounts =  collections.Counter([j.amount for j in allrecords])

In [11]: print amounts
Counter({0: 2, 20: 1, 5: 1})

请注意,您可以if在列表推导中使用语句。

In [12]: query = collections.Counter([j.amount for j in allrecords if j.time in ('afternoon', 'night')])

In [13]: print query
Counter({20: 1, 5: 1})
于 2013-04-30T21:20:02.370 回答