python - 什么是用于删除列中所有文本的 python 正则表达式？

Question

我正在尝试清理列：

df:
+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            | 0            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2 ET         |
+-----+------------------+--------------------+--------------------+--------------+--------------+

预期的

df:
+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            | 0            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+

我在尝试

df['away_score'] = df['away_score'].astype(str).str.replace('(\s?\w+)$', '', regex=True)

（适用于 regex101 但不适用于 pandas）

但是列中的所有数据都被替换了。

+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            |              |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            |              |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+

什么应该是正确的正则表达式？

score 2 · Accepted Answer

我试过这个正则表达式，它奏效了。

df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z]', '', regex=True)

score 1 · Accepted Answer

要完全清理文本（包括空格），您应该使用：

df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z\s]', '', regex=True)

这样，您还可以清理字母之前的空格，例如 in 之前的ET空格 ET。

如果您不仅要清理文本，还要清理一些非数字，包括符号（只留下数字），您可以使用：

df['away_score'] = df['away_score'].astype(str).str.replace('\D', '', regex=True)

python - 什么是用于删除列中所有文本的 python 正则表达式？

2 回答 2

Related

Reference