python - 如何通过正则表达式过滤熊猫中的行

Question

我想在其中一列上使用正则表达式干净地过滤数据框。

举一个人为的例子：

In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]: 
   a    b
0  1   hi
1  2  foo
2  3  fat
3  4  cat

我想将行过滤到以f使用正则表达式开头的行。先走：

In [213]: foo.b.str.match('f.*')
Out[213]: 
0    []
1    ()
2    ()
3    []

这不是非常有用。然而，这会给我我的布尔索引：

In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]: 
0    False
1     True
2     True
3    False
Name: b

所以我可以通过以下方式进行限制：

In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]: 
   a    b
1  2  foo
2  3  fat

这让我人为地将一个组放入正则表达式中，似乎可能不是干净的方式。有一个更好的方法吗？

score 240 · Accepted Answer

改用包含：

In [10]: df.b.str.contains('^f')
Out[10]: 
0    False
1     True
2     True
3    False
Name: b, dtype: bool

score 42 · Accepted Answer

已经有一个字符串处理函数Series.str.startswith()。你应该试试foo[foo.b.str.startswith('f')]。

结果：

    a   b
1   2   foo
2   3   fat

我想你所期望的。

或者，您可以使用包含正则表达式选项。例如：

foo[foo.b.str.contains('oo', regex= True, na=False)]

结果：

    a   b
1   2   foo

na=False是为了防止出现错误，以防出现 nan、null 等值

score 23 · Accepted Answer

It may be a bit late, but this is now easier to do in Pandas by calling Series.str.match. The docs explain the difference between match, fullmatch and contains.

Note that in order to use the results for indexing, set the na=False argument (or True if you want to include NANs in the results).

score 20 · Accepted Answer

使用数据框进行多列搜索：

frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]

score 16 · Accepted Answer

基于user3136169 的出色答案，这里有一个示例，说明如何删除 NoneType 值。

def regex_filter(val):
    if val:
        mo = re.search(regex,val)
        if mo:
            return True
        else:
            return False
    else:
        return False

df_filtered = df[df['col'].apply(regex_filter)]

您还可以将正则表达式添加为 arg：

def regex_filter(val,myregex):
    ...

df_filtered = df[df['col'].apply(regex_filter,regex=myregex)]

score 12 · Accepted Answer

编写一个布尔函数来检查正则表达式并在列上使用 apply

foo[foo['b'].apply(regex_function)]

score 1 · Accepted Answer

使用 Python 内置的编写 lambda 表达式的能力，我们可以通过任意正则表达式操作进行过滤，如下所示：

import re  

# with foo being our pd dataframe
foo[foo['b'].apply(lambda x: True if re.search('^f', x) else False)]

通过使用 re.search，您可以按复杂的正则表达式样式查询进行过滤，这在我看来更强大。（因为str.contains相当有限）

同样重要的是：您希望字符串以小“f”开头。通过使用正则表达式f.*，您可以在文本中的任意位置匹配 f。通过使用该^符号，您明确声明您希望它位于内容的开头。所以使用^f可能是一个更好的主意:)

score 1 · Accepted Answer

1

使用str 切片

foo[foo.b.str[0]=='f']
Out[18]: 
   a    b
1  2  foo
2  3  fat

于 2018-12-30T03:12:39.840 回答

python - 如何通过正则表达式过滤熊猫中的行

8 回答 8

Related

Reference