1

我有一个包含两个感兴趣的文件的数据集:docID 和类别。请注意,实际内容也是此数据框的一部分以及其他字段

JAN001 新闻、体育

JAN212 政治

FEB208 商业, 新闻

我正在尝试使用 Pandas 创建一个新的数据框,如下所示:

JAN001 新闻

JAN001 体育

JAN212 政治...

我知道我可以循环遍历数据框,但对 pandas 很陌生,并且认为有一种方法可以更有效地做到这一点。我曾尝试查看几个问题并尝试各种示例,但到目前为止都没有成功。我也很好奇索引是否是解决方案的一部分,但还没有探索过这个途径。感谢您的任何帮助或建议。


更新 - 这是代码和

{

foo = pd.read_csv("dtu_topic.txt", sep = "\t") 
foo = foo[:20]

print foo

#    id  dtu_docid                                          dtu_topic  \
#0   21523  2012-1553             Energy Taxation,State & Local Taxation
#1   21522  2012-1552            Legislation & Policy\Financial Services
#2   25470  2010-0227              Quantitative Economics and Statistics
#3   25477  2010-0215                        International Taxation\Asia
#4   21539  2012-1529  Ernst & Young Newsletters\This Week in Tax Reform
#5   25483  2010-0207                             State & Local Taxation
#6   21536  2012-1533             Payroll & Employment Tax\State & Local
#7   21537  2012-1532             Payroll & Employment Tax\State & Local
#8   24943  2010-0929  IRS Practice & Procedure,Tax Quality & Risk Ma...
#9   25500  2010-0185                      Financial Services Industries
#10  21542  2012-1524             Payroll & Employment Tax\State & Local
#11  21551  2012-1507                                   Personal Finance
#12  25523  2010-0159                      International Taxation\Europe
#13  21549  2012-1510             Payroll & Employment Tax\State & Local
#14  21557  2012-1501  Payroll & Employment Tax\Federal,Payroll & Emp...
#15  21558  2012-1498                   Accounting Methods & Inventories
#16  25567  2010-0104                                        Real Estate
#17  25529  2010-0152  Financial Services Industries,International Ta...
#18  21564  2012-1495                           IRS Practice & Procedure
#19  21563  2012-1494                   Payroll & Employment Tax\Federal

#parse dtu_topic into a list of categories
foo["dtu_topic_split"] = foo.dtu_topic.str.replace(',','\\')
foo["dtu_topic_split"] = foo.dtu_topic_split.str.split('\\').tolist()

# from example on stack overflow - get syntax error
dcm = foo[,list(dtu_docid = dtu_docid,
           dtu_topic = unlist(dtu_topic.split),
           by = 1:nrow(foo)]


                 #dt.2 <- dt[, list(Probe.Id = Probe.Id,
                 #                      Gene.Id = unlist(Gene.Id_split),
                 #                      Score.d = Score.d), by = 1:nrow(dt)]

#dcm = unlist(foo.dtu_topic_split)

print dcm

}

4

1 回答 1

0

看起来您正试图将列表框架变成有用的东西(您的示例实际上在您感兴趣的列中只有一个列表)

尝试这样的事情

In [101]: df = DataFrame(dict(A = [['foo','bar','bah']], B = [['foo','bah']], C = [['foo']]),index=range(4))

In [102]: df
Out[102]: 
                 A           B      C
0  [foo, bar, bah]  [foo, bah]  [foo]
1  [foo, bar, bah]  [foo, bah]  [foo]
2  [foo, bar, bah]  [foo, bah]  [foo]
3  [foo, bar, bah]  [foo, bah]  [foo]

In [103]: concat(dict([ (row[0],row[1].apply(lambda y: Series(y))) for row in df.iterrows() ]))
Out[103]: 
       0    1    2
0 A  foo  bar  bah
  B  foo  bah  NaN
  C  foo  NaN  NaN
1 A  foo  bar  bah
  B  foo  bah  NaN
  C  foo  NaN  NaN
2 A  foo  bar  bah
  B  foo  bah  NaN
  C  foo  NaN  NaN
3 A  foo  bar  bah
  B  foo  bah  NaN
  C  foo  NaN  NaN
于 2013-06-29T16:20:06.413 回答