我有一个包含两个感兴趣的文件的数据集:docID 和类别。请注意,实际内容也是此数据框的一部分以及其他字段
JAN001 新闻、体育
JAN212 政治
FEB208 商业, 新闻
我正在尝试使用 Pandas 创建一个新的数据框,如下所示:
JAN001 新闻
JAN001 体育
JAN212 政治...
我知道我可以循环遍历数据框,但对 pandas 很陌生,并且认为有一种方法可以更有效地做到这一点。我曾尝试查看几个问题并尝试各种示例,但到目前为止都没有成功。我也很好奇索引是否是解决方案的一部分,但还没有探索过这个途径。感谢您的任何帮助或建议。
更新 - 这是代码和
{
foo = pd.read_csv("dtu_topic.txt", sep = "\t")
foo = foo[:20]
print foo
# id dtu_docid dtu_topic \
#0 21523 2012-1553 Energy Taxation,State & Local Taxation
#1 21522 2012-1552 Legislation & Policy\Financial Services
#2 25470 2010-0227 Quantitative Economics and Statistics
#3 25477 2010-0215 International Taxation\Asia
#4 21539 2012-1529 Ernst & Young Newsletters\This Week in Tax Reform
#5 25483 2010-0207 State & Local Taxation
#6 21536 2012-1533 Payroll & Employment Tax\State & Local
#7 21537 2012-1532 Payroll & Employment Tax\State & Local
#8 24943 2010-0929 IRS Practice & Procedure,Tax Quality & Risk Ma...
#9 25500 2010-0185 Financial Services Industries
#10 21542 2012-1524 Payroll & Employment Tax\State & Local
#11 21551 2012-1507 Personal Finance
#12 25523 2010-0159 International Taxation\Europe
#13 21549 2012-1510 Payroll & Employment Tax\State & Local
#14 21557 2012-1501 Payroll & Employment Tax\Federal,Payroll & Emp...
#15 21558 2012-1498 Accounting Methods & Inventories
#16 25567 2010-0104 Real Estate
#17 25529 2010-0152 Financial Services Industries,International Ta...
#18 21564 2012-1495 IRS Practice & Procedure
#19 21563 2012-1494 Payroll & Employment Tax\Federal
#parse dtu_topic into a list of categories
foo["dtu_topic_split"] = foo.dtu_topic.str.replace(',','\\')
foo["dtu_topic_split"] = foo.dtu_topic_split.str.split('\\').tolist()
# from example on stack overflow - get syntax error
dcm = foo[,list(dtu_docid = dtu_docid,
dtu_topic = unlist(dtu_topic.split),
by = 1:nrow(foo)]
#dt.2 <- dt[, list(Probe.Id = Probe.Id,
# Gene.Id = unlist(Gene.Id_split),
# Score.d = Score.d), by = 1:nrow(dt)]
#dcm = unlist(foo.dtu_topic_split)
print dcm
}