python - 在 Python Pandas 的列表中过滤掉非英语句子

Question

所以有一个excel文件，我已经通过pandas读取并将其存储在数据框'df'中。现在该 excel 文件包含 24 列作为“问题”和 631 行作为“响应/答案”。

因此，我将一个这样的问题转换为一个列表，以便我可以对其进行标记并在其上应用更多与 nlp 相关的任务。

df_lst = df['Q8 Why do you say so ?'].values.tolist()

现在，这给了我一个包含 631 个句子的列表，其中一些句子是非英语的。所以我想过滤掉非英语句子，这样最后我就剩下一个只包含英语句子的列表.

我有的：

df_lst = ['The excecutive should be able to understand the customer's problem','Customers should get correct responses to their queries', 'This text is in a random non english language'...]

输出（我想要的）：

english_words = ['The excecutive should be able to understand the customer's problem','Customers should get correct responses to their queries', ...]

另外，我读到了一个名为 pyenchant 的 python 库，它应该能够做到这一点，但它与 windows 64bit 和 python 3 不兼容。有没有其他方法可以做到这一点？

谢谢！

score 1 · Accepted Answer

还有另一个库（与 nltk 密切相关），TextBlob，最初绑定到 Sentiment analysis，但您仍然可以使用它进行翻译，请参阅此处的文档：https ://textblob.readthedocs.io/en/dev/quickstart.html

部分翻译和语言检测

升

score 0 · Accepted Answer

您是否考虑过利用句子中英语“停用词”的数量？看看nltk包装。使用以下代码检查英语停用词：

import nltk
from ntlk.corpus import stopwords
ntlk.download('stopwords') # If you just installed the package
set(stopwords.words('english'))

您可以添加一个新列，指示每个句子中出现的英语停用词的数量。停用词的存在可以用作英语语言的预测器。

其他可行的方法是，如果您知道大多数答案都是以英语开头的事实，请对单词进行频率排名（可能针对数据中的每个问题）。在您的示例中，对于正在研究的问题，“客户”一词似乎始终如一地出现。因此，您可以设计一个变量来指示答案中是否存在非常频繁的单词。这也可以作为预测因素。不要忘记将所有单词设为小写或大写并处理复数 or 's，这样您就不会将“customer”、“Customer”、“customers”、“Customers”、“customer's”和“customers'”都列为不同的词。

在对上述变量进行工程化之后，您可以设置一个阈值，在该阈值之上您认为句子是用英语写的，或者您可以在无监督学习方面进行一些更花哨的事情。

python - 在 Python Pandas 的列表中过滤掉非英语句子

2 回答 2

Related

Reference