python - 在 pandas 中使用矢量化逻辑 not 过滤帧

Question

我有一个要修剪的熊猫数据框。我想取出节为 2 且标识符不以数字开头的行。首先，我想数一数。如果我运行这个

len(analytic_events[analytic_events['section']==2].index)

我得到结果 1247669

当我缩小范围并运行它时

len(analytic_events[(analytic_events['section']==2) & ~(analytic_events['identifier'][0].isdigit())].index)

我得到完全相同的答案：1247669

例如，我知道其中有 10 行将此作为其标识符

.help.your_tools.subtopic2

它不以数字开头，并且 15,000 行以此为标识符

240.1007

它确实以数字开头。

为什么我的过滤器会传递所有行，而不仅仅是那些标识符不以数字开头的行？

score 1 · Accepted Answer

用于str处理文本函数和str[0]字符串的第一个值，最后一个sum用于 count Trues 值：

mask= ((analytic_events['section']==2) & 
       ~(analytic_events['identifier'].str[0].str.isdigit()))

print (mask.sum())

如果性能很重要并且没有缺失值，请使用列表推导：

arr = ~np.array([x[0].isdigit() for x in analytic_events['identifier']])
mask = ((analytic_events['section']==2) & arr)

编辑：

为什么我的过滤器会传递所有行，而不仅仅是那些标识符不以数字开头的行？

如果您的解决方案的测试输出：

analytic_events = pd.DataFrame(
                        {'section':[2,2,2,3,2],
                         'identifier':['4hj','8hj','gh','th','h6h']})

print (analytic_events)
   section identifier
0        2        4hj
1        2        8hj
2        2         gh
3        3         th
4        2        h6h

获取列的第一个值：

print ((analytic_events['identifier'][0]))
4hj

检查标量的位数：

print ((analytic_events['identifier'][0].isdigit()))
False

print (~(analytic_events['identifier'][0].isdigit()))
-1

使用带有第一个掩码的链，它被转换为True：

print ((analytic_events['section']==2) & ~(analytic_events['identifier'][0].isdigit()))
0     True
1     True
2     True
3    False
4     True
Name: section, dtype: bool

所以它像第二个面具一样工作不存在：

print (analytic_events['section']==2)
0     True
1     True
2     True
3    False
4     True
Name: section, dtype: bool

score 1 · Accepted Answer

您应该尝试使用该系列的str属性，identifier如下所示：

sum((analytic_events[(analytic_events['section']==2)) & ~(analytic_events['identifier'].str[0].str.isdigit())].index)

python - 在 pandas 中使用矢量化逻辑 not 过滤帧

2 回答 2

Related

Reference