1

I'm working on a data mining problem for my Master Thesis. I'm using Python for data analysis, but I have no experience with Pandas, which is needed to convert my data to a Dataframe. In order to do Survival Regression with a Python package called Lifelines I need to create a Covariate Matrix from my experiment_data dict containing over 16k of dicts with Twitter data about Kickstarter projects (see example dict below).

16041: {'goal': 1200, 'launch': 1353544772, 'days-before-deadline': 3, 'followers': 149, 'date-funded': 1355887690.9189188, 'id': 52687, 'tweet_ids': [280965208409796608, ... n], 'state': 1, 'deadline': 1356136772, 'retweets': 0, 'favorites': 0, 'duration': 31, 'timestamps': [1355876412.0], 'favourites': 0, 'runtime': 27, 'friends': 127, 'pledges': [0.0, 0.0625, 0.0625, ... n], 'statuses': 7460}

If I create a Pandas Dataframe from this dict, I'll be able to create a Covariate Matrix by using Patsy, for example like this:

X = patsy.dmatrix('friends + followers + retweets, favorites -1', data, return_type='dataframe') 

Now my question is how to create a Pandas Dataframe from the experiment_data dicts? The keys of the inner dictionaries (goal, launch, followers, etc.) should be columns for each Kickstarter project (i.e. index nr.: 0 to 16041).

Any help would be really appreciated. Thanks in advance!

P.S. If you have experience in Survival Regression using Python and Lifelines, please let me know!

4

1 回答 1

1

我想你想from_dict使用 param orient='index'

In [31]:
d={16041: {'goal': 1200, 'launch': 1353544772, 'days-before-deadline': 3, 'followers': 149, 'date-funded': 1355887690.9189188, 'id': 52687, 'tweet_ids': [280965208409796608], 'state': 1, 'deadline': 1356136772, 'retweets': 0, 'favorites': 0, 'duration': 31, 'timestamps': [1355876412.0], 'favourites': 0, 'runtime': 27, 'friends': 127, 'pledges': [0.0, 0.0625, 0.0625], 'statuses': 7460}}
pd.DataFrame.from_dict(d, orient='index')    

Out[31]:
          id  followers  days-before-deadline  statuses  duration  state  \
16041  52687        149                     3      7460        31      1   

       goal             tweet_ids                pledges  favourites  \
16041  1200  [280965208409796608]  [0.0, 0.0625, 0.0625]           0   

         deadline  favorites  retweets  runtime  friends      launch  \
16041  1356136772          0         0       27      127  1353544772   

           timestamps   date-funded  
16041  [1355876412.0]  1.355888e+09 
于 2015-07-22T12:10:54.350 回答