python - 在 Python (ProblemSetQuestion) 中解析句子（或其他更长的字符串）如何进行？

Question

好的，所以我没有成功搜索该网站以在 Python 中解析长字符串（或句子，如果您愿意）。如果有以前回答过的相同性质的问题，请将我重定向到它！总之，嗨！我是一名初学者程序员（使用互联网自学 Python），我正在寻找一个（看似简单的）问题的帮助。如果您对此问题有任何意见，请不要犹豫，按照您认为合适的方式回答问题，但如果您向我解释您的解决方案或编码示例，这将真正对我有所帮助！此外，我解决这个问题的唯一想法是使用 ascii 值删除所有标点符号，如果语句将非常长，然后使用剩余的空格将剩余的文本拆分，同时将它们附加到列表中。为了节省您的时间并让我学习新的东西宁愿看不到最长的表达式语句！另请记住，这是一个返回列表的函数，因此不要费心将其（返回）转换为字符串或不同的数据类型，例如字典。提前感谢您提供的任何帮助！

废话不多说，问题来了：

解析一个字符串

创建一个将字符串作为输入并返回 >string 中所有单词的列表的函数。它应该删除所有标点符号，用空格替换破折号。

示例（电话）：

    >>> parse("Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.") 
   [Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, for, a, system, of, government, Supreme, executive, power, derives, from, a, mandate, from, the, masses, not, from, some, farcical, aquatic, ceremony] 
    >>> parse("What... is the air-speed velocity of an unladen swallow?") 
    [What, is, the, air, speed, velocity, of, an, unladen, swallow]

对于代码长度的运行，我感到非常抱歉！无论如何，我认为你们都明白仅从问题本身应该做什么。绝对欢迎任何建议或独特/有效的解决方案！- 温克尔森

Ps 对于连续的句子和“文本墙”非常抱歉。我有点健谈...无论如何，再次感谢您的帮助！

请注意，输出不是列表！答案中不能包含更多符号！请不要忘记！再次感谢你的帮助！抱歉，问题的作者与答案不匹配！

score 3 · Accepted Answer

使用自然语言工具包 (nltk)真的很容易。

import nltk, string
text = "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony."

tokens = nltk.word_tokenize(text)

# remove punctuation
tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]

正在使用：

>>> text = "Listen, strange women lyin' in ponds distributin' swords is no basis
 for a system of government. Supreme executive power derives from a mandate from
 the masses, not from some farcical aquatic ceremony."
>>> tokens = nltk.word_tokenize(text)
>>> tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]
>>> tokens
['Listen', 'strange', 'women', 'lyin', 'in', 'ponds', 'distributin', 'swords', '
is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government.', 'Supreme', 'execu
tive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses', 'not
', 'from', 'some', 'farcical', 'aquatic', 'ceremony']

显然，您想要的输出非常不清楚，但如果您正在寻找该输出的字符串版本，您可以使用该tokens变量并执行以下操作：

print '[' + ', '.join(tokens) + ']'

好像：

>>> print '['+', '.join(tokens)+']'
[Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, fo
r, a, system, of, government., Supreme, executive, power, derives, from, a, mand
ate, from, the, masses, not, from, some, farcical, aquatic, ceremony]

你的“文字墙”确实让你很难弄清楚你想要什么。

score 2 · Accepted Answer

In [133]: punc = set('.,<>!@#$%^&*()-_+=]}{[\\|')

In [134]: [''.join(char for char in word if char not in punc) for word in "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.".split()]
Out[134]: 
['Listen',
 'strange',
 'women',
 "lyin'",
 'in',
 'ponds',
 "distributin'",
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony']

score 1 · Accepted Answer

我建议使用正则表达式，像这样

import re

re.findall(r'[a-zA-Z]+',input_string)

或者做多个字符串，先编译正则表达式

regexp=re.compile(r'[a-zA-Z]+')
regexp.findall(test)

本质上，这是要求所有包含字母的字符，按字符分组。例如，如果您想包含缩略词，您只需将 ' 添加到表达式中，如下所示：

re.findall(r'[a-zA-Z']+',input_string)

python - 在 Python (ProblemSetQuestion) 中解析句子（或其他更长的字符串）如何进行？

3 回答 3

Related

Reference