2

好的,所以我没有成功搜索该网站以在 Python 中解析长字符串(或句子,如果您愿意)。如果有以前回答过的相同性质的问题,请将我重定向到它!总之,嗨!我是一名初学者程序员(使用互联网自学 Python),我正在寻找一个(看似简单的)问题的帮助。如果您对此问题有任何意见,请不要犹豫,按照您认为合适的方式回答问题,但如果您向我解释您的解决方案或编码示例,这将真正对我有所帮助!此外,我解决这个问题的唯一想法是使用 ascii 值删除所有标点符号,如果语句将非常长,然后使用剩余的空格将剩余的文本拆分,同时将它们附加到列表中。为了节省您的时间并让我学习新的东西 宁愿看不到最长的表达式语句!另请记住,这是一个返回列表的函数,因此不要费心将其(返回)转换为字符串或不同的数据类型,例如字典。提前感谢您提供的任何帮助!

废话不多说,问题来了:


解析一个字符串

创建一个将字符串作为输入并返回 >string 中所有单词的列表的函数。它应该删除所有标点符号,用空格替换破折号。


示例(电话):

    >>> parse("Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.") 
   [Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, for, a, system, of, government, Supreme, executive, power, derives, from, a, mandate, from, the, masses, not, from, some, farcical, aquatic, ceremony] 
    >>> parse("What... is the air-speed velocity of an unladen swallow?") 
    [What, is, the, air, speed, velocity, of, an, unladen, swallow]

对于代码长度的运行,我感到非常抱歉!无论如何,我认为你们都明白仅从问题本身应该做什么。绝对欢迎任何建议或独特/有效的解决方案!- 温克尔森

Ps 对于连续的句子和“文本墙”非常抱歉。我有点健谈...无论如何,再次感谢您的帮助!

请注意,输出不是列表!答案中不能包含更多符号!请不要忘记!再次感谢你的帮助!抱歉,问题的作者与答案不匹配!

4

3 回答 3

3

使用自然语言工具包 (nltk)真的很容易。

import nltk, string
text = "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony."

tokens = nltk.word_tokenize(text)

# remove punctuation
tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]

正在使用:

>>> text = "Listen, strange women lyin' in ponds distributin' swords is no basis
 for a system of government. Supreme executive power derives from a mandate from
 the masses, not from some farcical aquatic ceremony."
>>> tokens = nltk.word_tokenize(text)
>>> tokens = [word.replace("-"," ") for word in tokens if word not in string.punctuation]
>>> tokens
['Listen', 'strange', 'women', 'lyin', 'in', 'ponds', 'distributin', 'swords', '
is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government.', 'Supreme', 'execu
tive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses', 'not
', 'from', 'some', 'farcical', 'aquatic', 'ceremony']

显然,您想要的输出非常不清楚,但如果您正在寻找该输出的字符串版本,您可以使用该tokens变量并执行以下操作:

print '[' + ', '.join(tokens) + ']'

好像:

>>> print '['+', '.join(tokens)+']'
[Listen, strange, women, lyin, in, ponds, distributin, swords, is, no, basis, fo
r, a, system, of, government., Supreme, executive, power, derives, from, a, mand
ate, from, the, masses, not, from, some, farcical, aquatic, ceremony]

你的“文字墙”确实让你很难弄清楚你想要什么。

于 2012-11-20T20:08:10.813 回答
2
In [133]: punc = set('.,<>!@#$%^&*()-_+=]}{[\\|')

In [134]: [''.join(char for char in word if char not in punc) for word in "Listen, strange women lyin' in ponds distributin' swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony.".split()]
Out[134]: 
['Listen',
 'strange',
 'women',
 "lyin'",
 'in',
 'ponds',
 "distributin'",
 'swords',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'masses',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony']
于 2012-11-20T19:50:12.557 回答
1

我建议使用正则表达式,像这样

import re

re.findall(r'[a-zA-Z]+',input_string)

或者做多个字符串,先编译正则表达式

regexp=re.compile(r'[a-zA-Z]+')
regexp.findall(test)

本质上,这是要求所有包含字母的字符,按字符分组。例如,如果您想包含缩略词,您只需将 ' 添加到表达式中,如下所示:

re.findall(r'[a-zA-Z']+',input_string)
于 2012-11-20T20:09:45.493 回答