Disclaimer: I have fairly straightforward/brute-force ways of solving my questions; the goal of the question is to learn better approaches and libraries that assist with these calculations.
I have a fairly sizeable csv (100k+ lines) with people, location, time data, and money spent, amongst other things. Say, something like:
thomas, park, noon, 0
jim, pool, afternoon, 5
sandy, school, noon, 0
alex, mall, night, 20
As I approach this corpus of data a few things of interest I'd like to discover, and the ways I'd go about doing them. Currently I implement things with a blend of R and Python (and RPy2).
- Most active people? Most visited places? Most busy time? Easy tally of occurrences that I tally via a for loop.
Similarity -- people who visit X also visit Y -- given a subset of people who visit a park, what are the other locations they visit? Can be applied to the other dimensions as well. Currently I implement this by iterating through the subset and tally things up. What's better?
slight digression for 3-4; found libraries but would love to hear better approaches/libraries
- Visualization via network graphs to see clusters/concentrations -- each person is defined as a vertex and the shared location is defined as a colored edge. Preprocessing of the data is kinda a pain due to my data format; I can also "cheat" by having edges be both people and locations+time since that's involves less preprocessing. Currently using weighted graph in R (igraph library).
- Cluster analysis to see if data falls into certain bins; right now I'm just using k-means clustering.
So to reiterate, given the nature of my inquiries, what would be good libraries that have prebuilt and optimized functions to answer some of my questions? It just seems that using a bunch of for loops is a really inefficient and inelegant way to gather insight.