python - 替换 Pandas DataFrame 中出现撇号的文本的问题

Question

我正在使用一个 Pandas DataFrame，我从 Excel 中读取了它，并且想要查找和替换文本中的缩略词（例如，不要 -> 不要）。我使用的代码在替换不包含撇号的文本时有效，但不适用于包含撇号的单词。

我已经定义了一个字典来指定要进行哪些替换。我在下面提供了一个示例，以及执行替换的代码。

contractions_dict = { 
'ain\'t': 'is not', 'aren\'t': 'are not', 'can\'t': 'can not', '\'cause': "because",
'coz': "because", 'cos': "because", 'could\'ve': "could have", 'couldn\'t': "could not",
'didn\'t': "did not", 'doesn\'t': "does not", 'don\'t': 'do not',
'no contractions': 'TEST'
}

regex_dict = {r"(\b){}(\b)".format(k):r"\1{}\2".format(v) for k,v in contractions_dict.items()}
regex_dict


data = {'Text_with_contractions': ['Text with no contractions', "Text with contractions doesn't work", 'More text']}
df = pd.DataFrame(data)

df['Text_with_no_contractions'] = df['Text_with_contractions'].replace(regex_dict, regex=True)
df['Text_with_contractions'].iloc[1]

奇怪的是，上面的代码在我手动创建的数据帧上测试时有效，但它不适用于我从 Excel 读取的数据帧。任何想法为什么？

这是它工作的手动创建的数据框：

data = {'Text_with_contractions': ['Text with no contractions', "Text with contractions doesn't work", 'More text']}
df = pd.DataFrame(data)

这是我用来在它不起作用的数据框中读取的代码：

df = pd.read_excel(path + "output.xlsx", encoding = "UTF-8")

我尝试在撇号之前使用转义字符（如上）。我试过双引号和单引号作为撇号

如果有人可以帮助确定为什么它不适用于 Excel 读取的数据并提出解决方案，我将不胜感激。

score 0 · Accepted Answer

好的，所以我发现了问题所在。字典包含字符 ' 作为撇号，但数据框包含字符 '</p>

现在都在工作

python - 替换 Pandas DataFrame 中出现撇号的文本的问题

1 回答 1

Related

Reference