python-3.x - 提取数据框中的首字母缩略词和毛利语（非英语）单词，并将它们放在数据框中的相邻列中

Question

正则表达式对我来说似乎是一条陡峭的学习曲线。我有一个包含文本（最多 300,000 行）的数据框。包含在outcome名为的虚拟文件列中的文本foo_df.csv混合了英语单词、首字母缩略词和毛利语单词。foo_df.csv是这样的：

    outcome
0   I want to go to DHB
1   Self Determination and Self-Management Rangatiratanga
2   mental health wellness and AOD counselling
3   Kai on my table
4   Fishing
5   Support with Oranga Tamariki Advocacy
6   Housing pathway with WINZ
7   Deal with personal matters
8   Referral to Owaraika Health services

我想要的结果是下面的表格形式，其中包含Abreviation和Māori_word列：

    outcome                                                 Abbreviation     Māori_word             
0   I want to go to DHB                                     DHB      
1   Self Determination and Self-Management Rangatiratanga                    Rangatiratanga
2   mental health wellness and AOD counselling              AOD              
3   Kai on my table                                                          Kai
4   Fishing                                                                  
5   Support with Oranga Tamariki Advocacy                                    Oranga Tamariki
6   Housing pathway with WINZ                               WINZ             
7   Deal with personal matters                                               
8   Referral to Owaraika Health services                                     Owaraika

我使用的方法是使用正则表达式提取缩写词，并使用 nltk 模块提取毛利语单词。

我已经能够使用以下代码使用正则表达式提取缩写词：

pattern = '(\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b)'
foo_df['Abbreviation'] = foo_df.outcome.str.extract(pattern)

我已经能够使用以下代码从句子中提取非英语单词：

import nltk
nltk.download('words')
from nltk.corpus import words

words = set(nltk.corpus.words.words())

sent = "Self Determination and Self-Management Rangatiratanga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if not w.lower() in words or not w.isalpha())

TypeError: expected string or bytes-like object但是，当我尝试在数据帧上迭代上述代码时出现错误。我尝试的迭代如下：

def no_english(text):
  words = set(nltk.corpus.words.words())
  " ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
         if not w.lower() in words or not w.isalpha())

foo_df['Māori_word'] = foo_df.apply(no_english, axis = 1)
print(foo_df)

python3中的任何帮助将不胜感激。谢谢。

score 1 · Accepted Answer

您无法通过简单的短正则表达式神奇地判断一个单词是否是英语/毛利语/缩写。实际上，很可能某些词可以在多个类别中找到，因此任务本身不是二元的（或者在这种情况下是三元的）。

你想做的是自然语言处理，这里有一些python语言检测库的例子。您将得到输入是给定语言的概率。这通常在全文上运行，但您可以将其应用于单个单词。

另一种方法是使用毛利语和缩写词词典（=详尽/选定的单词列表）并制作一个函数来判断一个单词是否是其中之一，否则假定为英语。

python-3.x - 提取数据框中的首字母缩略词和毛利语（非英语）单词，并将它们放在数据框中的相邻列中

1 回答 1

Related

Reference