1

我正在尝试清理列:

df:
+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            | 0            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2 ET         |
+-----+------------------+--------------------+--------------------+--------------+--------------+

预期的

df:
+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            | 0            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+

我在尝试

df['away_score'] = df['away_score'].astype(str).str.replace('(\s?\w+)$', '', regex=True)

(适用于 regex101 但不适用于 pandas)

但是列中的所有数据都被替换了。

+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            |              |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            |              |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+

什么应该是正确的正则表达式?

4

2 回答 2

2

我试过这个正则表达式,它奏效了。

df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z]', '', regex=True)
于 2021-06-14T04:29:28.617 回答
1

要完全清理文本(包括空格),您应该使用:

df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z\s]', '', regex=True)

这样,您还可以清理字母之前的空格,例如 in 之前的ET空格 ET

如果您不仅要清理文本,还要清理一些非数字,包括符号(只留下数字),您可以使用:

df['away_score'] = df['away_score'].astype(str).str.replace('\D', '', regex=True)
于 2021-06-14T04:53:48.683 回答