正则表达式对我来说似乎是一条陡峭的学习曲线。我有一个包含文本(最多 300,000 行)的数据框。包含在outcome
名为的虚拟文件列中的文本foo_df.csv
混合了英语单词、首字母缩略词和毛利语单词。foo_df.csv
是这样的:
outcome
0 I want to go to DHB
1 Self Determination and Self-Management Rangatiratanga
2 mental health wellness and AOD counselling
3 Kai on my table
4 Fishing
5 Support with Oranga Tamariki Advocacy
6 Housing pathway with WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services
我想要的结果是下面的表格形式,其中包含Abreviation
和Māori_word
列:
outcome Abbreviation Māori_word
0 I want to go to DHB DHB
1 Self Determination and Self-Management Rangatiratanga Rangatiratanga
2 mental health wellness and AOD counselling AOD
3 Kai on my table Kai
4 Fishing
5 Support with Oranga Tamariki Advocacy Oranga Tamariki
6 Housing pathway with WINZ WINZ
7 Deal with personal matters
8 Referral to Owaraika Health services Owaraika
我使用的方法是使用正则表达式提取缩写词,并使用 nltk 模块提取毛利语单词。
我已经能够使用以下代码使用正则表达式提取缩写词:
pattern = '(\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b)'
foo_df['Abbreviation'] = foo_df.outcome.str.extract(pattern)
我已经能够使用以下代码从句子中提取非英语单词:
import nltk
nltk.download('words')
from nltk.corpus import words
words = set(nltk.corpus.words.words())
sent = "Self Determination and Self-Management Rangatiratanga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if not w.lower() in words or not w.isalpha())
TypeError: expected string or bytes-like object
但是,当我尝试在数据帧上迭代上述代码时出现错误。我尝试的迭代如下:
def no_english(text):
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
foo_df['Māori_word'] = foo_df.apply(no_english, axis = 1)
print(foo_df)
python3中的任何帮助将不胜感激。谢谢。