0

========================更新#2 ======================= =======================

多么美好的一天。我进展非常缓慢。但是,虽然 PANDAS 非常快速和强大,但它有一个陡峭的学习曲线,并且没有很好的例子(至少对于我正在尝试做的事情)。

最新一期是关于特定行的:

 catfile = infile[infile['dtu_topic_split'].map(lambda x: any(targetcat in x))]

它适用于 IPyNotebook,但不适用于 Ubuntu 和 python 2.7

这是Ubuntu上的错误:

    Traceback (most recent call last):
      File "scikit2.py", line 27, in <module>
        catfile = infile[infile['dtu_topic_split'].map(lambda x: any(targetcat in x))]
      File "/usr/local/lib/python2.7/dist-packages/pandas-0.11.0-py2.7-linux-x86_64.egg/pandas/core/series.py", line 2408, in map
        mapped = map_f(values, arg)
      File "inference.pyx", line 861, in pandas.lib.map_infer (pandas/lib.c:41822)
      File "scikit2.py", line 27, in <lambda>
        catfile = infile[infile['dtu_topic_split'].map(lambda x: any(targetcat in x))]
    TypeError: 'bool' object is not iterable

和 iPyNotebook 中的工作代码 + 结果

targetcat = 'Financial Services Industries'
#targetcat = 'Payroll & Employment Tax'
criterion = foo[foo['dtu_topic_split'].map(lambda x: any(targetcat in x))]
print criterion[['dtu_docid','dtu_topic_split']][:10]



     dtu_docid                                    dtu_topic_split
9    2010-0185                    [Financial Services Industries]
17   2010-0152  [Financial Services Industries, International ...
46   2012-1421  [Financial Services Industries, Payroll & Empl...
49   2012-1413  [Financial Services Industries, Payroll & Empl...
66   2012-1370  [Energy Taxation, Financial Services Industrie...
94   2009-1786                    [Financial Services Industries]
144  2012-1170       [Financial Services Industries, Real Estate]
163  2012-1101       [Financial Services Industries, Real Estate]
170  2009-1386                    [Financial Services Industries]
249  2012-0754  [Expatriate Taxation, Financial Services Indus...

这是 iPYNotebook 的 python 版本

print sys.version
2.7.4 (default, Apr 19 2013, 18:28:01) 
[GCC 4.7.3]

并来自 Ubuntu:

>>> import sys
>>> print sys.version
2.7.4 (default, Apr 19 2013, 18:28:01) 
[GCC 4.7.3]
>>> 

需要帮忙。如果我使用传统处理,我确信我可以完成这个数据设置和修饰。仍在尝试 PANDAS,但这是艰难的雪橇,最可悲的是我什至不确定为什么我要工作的东西有效。这些类型的错误会滋生挫败感

======================== 更新#1 ======================= =======================

使用第一个答案中的信息(感谢 tshauck)我找到了一种解决问题的方法:

targetcat = 'International Taxation'
criterion = foo[foo['dtu_topic_split'].map(lambda x: any(targetcat in x))]

这会产生 targetcat 在 dataframe.dtu_topic_split 系列中的行列表。鉴于我是熊猫新手,这是最好的处理方式。我的目的是为 30-50 个类别中的每个类别构建单独的培训模块。我不确定是否应该以更传统的 python 样式迭代大约 100K 记录,或者使用 pandas 技术。再次,任何替代方案或建议将不胜感激。


我是 Pandas 的新手,正在努力学习如何利用强大的功能。我昨天发布了一个通过构建单独的数据框来解决这个问题的策略。阅读更多后,我不确定它是最有效的。我已经尝试了几种技术来根据数据帧的系列字段中特定值的存在来选择数据帧中的特定行。以下是数据示例和我的尝试。

print foo[['dtu_docid','dtu_topic_split']]

/home/davidwaldrop/Dropbox/Miscelaneous/E&Y M&C Project/scikit training
   dtu_docid                                    dtu_topic_split
0  2012-1553          [Energy Taxation, State & Local Taxation]
1  2012-1552         [Legislation & Policy, Financial Services]
2  2010-0227            [Quantitative Economics and Statistics]
3  2010-0215                     [International Taxation, Asia]
4  2012-1529  [Ernst & Young Newsletters, This Week in Tax R...

这是我现在正在做的事情,但无济于事:

targetcat = ['International Taxation']

criterion = foo['dtu_topic_split'].map(lambda x: x == targetcat)

print foo[criterion]

Empty DataFrame
Columns: [id, dtu_docid, dtu_topic, dtu_content, dtu_topic_split]
Index: []

我想要的是一个数据框,其中包含“国际税收”在存储在字段 dtu_topic_split 中的系列中的记录,或者在上面的示例中,foo[3] 中的记录具有 [国际税收,亚洲] 的 dtu_topic_split 值。

正如我所提到的,我真的在努力学习 Pandas,并认为它非常强大。作为一个新手,不仅要找到一种方法来做我想做的事,而且要找到最好的方法以及理性是非常困难的。我的直觉告诉我,这可能最好通过索引来完成,但我什至还没有使用该功能。任何见解都非常感谢。

4

2 回答 2

2

希望我能很好地理解您的特定用例,以提供一个体面的答案。

给定一些数据:

data = """
dtu_docid|dtu_topic_split
9|2010-0185|['Financial Services Industries']
17|2010-0152|['Financial Services Industries', 'International']
46|2012-1421|['Financial Services Industries', 'Payroll & Employment Tax']
49|2012-1413|['Financial Services Industries', 'Payroll & Employment Tax']
66|2012-1370|['Energy Taxation', 'Financial Services Industries']
94|2009-1786|['Financial Services Industries']
144|2012-1170|['Financial Services Industries', 'Real Estate']
163|2012-1101|['Financial Services Industries', 'Real Estate']
170|2009-1386|['Financial Services Industries']
249|2012-0754|['Expatriate Taxation', 'Financial Services Industries']
""".split('\n')

考虑到这个问题:

“我想要的是一个数据框,其中包含'国际税收'在存储在字段 dtu_topic_split 的系列中的记录”

您可能会将其放入 DataFrame

rows = [row for row in data if len(row) > 0]

cleaned = []
for i, row in enumerate(rows):
    row = row.split('|')
    if i == 0:
        headers = row
    else:
        row = row[1:] # get rid of the index
        row[1] = eval(row[1])
        cleaned.append(row)

df = pd.DataFrame(cleaned, columns=headers)

看起来像这样:

print df
   dtu_docid                                    dtu_topic_split
0  2010-0185                    [Financial Services Industries]
1  2010-0152     [Financial Services Industries, International]
2  2012-1421  [Financial Services Industries, Payroll & Empl...
3  2012-1413  [Financial Services Industries, Payroll & Empl...
4  2012-1370   [Energy Taxation, Financial Services Industries]
5  2009-1786                    [Financial Services Industries]
6  2012-1170       [Financial Services Industries, Real Estate]
7  2012-1101       [Financial Services Industries, Real Estate]
8  2009-1386                    [Financial Services Industries]
9  2012-0754  [Expatriate Taxation, Financial Services Indus...

现在你有了这个尴尬dtu_topic_split的列,它是一个 python 列表。处理起来有点棘手。

要选择包含您感兴趣的项目的行,您可以apply使用lambda函数。例如:

print df.dtu_topic_split.apply(lambda x: 'Energy Taxation' in x)

这会给你一个布尔系列。

0    False
1    False
2    False
3    False
4     True
5    False
6    False
7    False
8    False
9    False
Name: dtu_topic_split, dtype: bool

然后您可以将其传递给df[...]via sub notation。

energy = df[df.dtu_topic_split.apply(lambda x: 'Energy Taxation' in x)]

print energy
   dtu_docid                                   dtu_topic_split
4  2012-1370  [Energy Taxation, Financial Services Industries]

另一种可能效果更好的方法是将数据转换为长格式

回到cleaned变量(列表列表),您可以编写一个小函数来“堆叠”具有多个主题的行。

def make_long(cleaned):
    lng = []
    for row in cleaned:
        # row is a list of length 2
        topics = row[1] # second item is the list of topics
        dtu_docid = row[0]
        for topic in topics:
            lng.append([dtu_docid, topic])

    return lng

在这种情况下,cleaned长度为 10 行。当您调用 时make_long,您最终会得到 17 行,因为任何具有超过 1 个主题的行都会出现不止一次。

make_long(cleaned)
Out[208]: 
[['2010-0185', 'Financial Services Industries'],
 ['2010-0152', 'Financial Services Industries'],
 ['2010-0152', 'International'],
 ['2012-1421', 'Financial Services Industries'],
 ['2012-1421', 'Payroll & Employment Tax'],
 ['2012-1413', 'Financial Services Industries'],
 ['2012-1413', 'Payroll & Employment Tax'],
 ['2012-1370', 'Energy Taxation'],
 ['2012-1370', 'Financial Services Industries'],
 ['2009-1786', 'Financial Services Industries'],
 ['2012-1170', 'Financial Services Industries'],
 ['2012-1170', 'Real Estate'],
 ['2012-1101', 'Financial Services Industries'],
 ['2012-1101', 'Real Estate'],
 ['2009-1386', 'Financial Services Industries'],
 ['2012-0754', 'Expatriate Taxation'],
 ['2012-0754', 'Financial Services Industries']]

然后你可以把它粘贴到一个数据框中,你应该做生意了。

lng = pd.DataFrame(make_long(cleaned),
    columns=['dtu_docid', 'dtu_topic_split'])

print lng
    dtu_docid                dtu_topic_split
0   2010-0185  Financial Services Industries
1   2010-0152  Financial Services Industries
2   2010-0152                  International
3   2012-1421  Financial Services Industries
4   2012-1421       Payroll & Employment Tax
5   2012-1413  Financial Services Industries
6   2012-1413       Payroll & Employment Tax
7   2012-1370                Energy Taxation
8   2012-1370  Financial Services Industries
9   2009-1786  Financial Services Industries
10  2012-1170  Financial Services Industries
11  2012-1170                    Real Estate
12  2012-1101  Financial Services Industries
13  2012-1101                    Real Estate
14  2009-1386  Financial Services Industries
15  2012-0754            Expatriate Taxation
16  2012-0754  Financial Services Industries

这样,您可以使用对象isin上的方法一次选择一个或多个主题的行pd.Series

selected = ['Financial Services Industries', 'Real Estate']
print lng[lng.dtu_topic_split.isin(selected)]

    dtu_docid                dtu_topic_split
0   2010-0185  Financial Services Industries
1   2010-0152  Financial Services Industries
3   2012-1421  Financial Services Industries
5   2012-1413  Financial Services Industries
8   2012-1370  Financial Services Industries
9   2009-1786  Financial Services Industries
10  2012-1170  Financial Services Industries
11  2012-1170                    Real Estate
12  2012-1101  Financial Services Industries
13  2012-1101                    Real Estate
14  2009-1386  Financial Services Industries
16  2012-0754  Financial Services Industries

希望其中一些有用!

于 2013-07-20T05:20:07.917 回答
0

这可能不是您的问题的确切原因,但对我来说突出的一件事是您正在比较两个列表的完全相等性......当(如果我理解)您想要比较 in 的存在targetcatdtu_topic_split。 ..我猜这是主题列表。

假设是这样的情况可能会起作用:

targetcat = ['International Taxation']

criterion = foo['dtu_topic_split'].map(lambda possiblecat: \
    any([t in p for t in targetcat for p in possiblecat]))

我没有对此进行测试,但我认为如果 in 中的任何类别targetcat包含在possiblecat.

于 2013-06-30T14:15:19.187 回答