-2

我使用 python 正则表达式在我的文本中识别了首字母缩写词,其中一些在末尾有一个 's 或一个 '.' 在他们的最后。为了清理我的文本,我正在构建一个字典。我需要'。从首字母缩略词的末尾删除,从字典中完全删除任何常规英语单词,并删除首字母缩略词末尾出现的“s”。

输入字典:

{'ceos': 'CEOs', 'cis': 'CIS', 'ceo': 'CEO', 'cios': 'CIOs', 'cio.': 'CIO.', 'cio': 'CIO','info': 'INFO', 'update': 'UPDATE', 'additional': 'ADDITIONAL', '.': '.', 'kpis': 'KPIs'}

所需的输出字典:

{'ceos': 'CEO', 'cis': 'CIS', 'ceo': 'CEO', 'cios': 'CIO', 'cio.': 'CIO', 'cio': 'CIO', '.': '', 'kpis': 'KPI'}

我应该如何在 python 中编码来实现这一点?

4

1 回答 1

0

没关系,我找到了一个很长的解决方案,但欢迎任何缩短它的建议:

from nltk.corpus import words
#only lower case of words work in words.words()
overall_dict_1=overall_dict.copy()

#remove . from key:value, any values with 's' or '.' modified to remove these and most of the english words removed from dictionary
for key, value in overall_dict.items():
    #print(key)
    if value[-1] in ['s','.']:
        y=len(value)-1
        overall_dict_1[key] = value[0:y]

    if key=='.':
        overall_dict_1.pop(key)

    if not (key in ['ai','it','us','es','coo','lan','ea','aer','coe','eu','bot','sa','ma','roi','pa','dod','doe','cad','ope','soc','aum','mot','da','ae','ca','swot','iso','ba','sla','mou','dit','ist','wa','ram','wog','la','ad','os','sis','sow','lam','sop','bod','pst','ga','mo']):
        if (key in words.words())== True:
            overall_dict_1.pop(key)  
于 2019-11-30T12:06:19.670 回答