0

我是 Python 新手,找不到删除无用文本的方法。主要目的是保留我想要的单词并删除所有其他单词。在这个阶段,我可以检查我的in_data并找到我想要的单词。如果sentence.find(wordToCheck)是肯定的,则保留它。in_data是每行句子,但当前输出是每行一个单词。我想要的是保留格式,在每一行中找到单词并删除其余部分。

import Orange
import orange

word = ['roaming','overseas','samsung']
out_data = []

for i in range(len(in_data)):
    for j in range(len(word)):
        sentence = str(in_data[i][0])
        wordToCheck = word[j]
        if(sentence.find(wordToCheck) >= 0):
            print wordToCheck

输出

roaming
overseas
roaming
overseas
roaming
overseas
samsung
samsung

in_data是类似的句子

contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas.

我希望看到输出就像

overseas roaming overseas
4

4 回答 4

3

您可以为此使用正则表达式:

>>> import re
>>> word = ['roaming','overseas','samsung']
>>> s =  "Contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."
>>> pattern = r'|'.join(map(re.escape, word))
>>> re.findall(pattern, s)
['overseas', 'roaming', 'overseas']
>>> ' '.join(_)
'overseas roaming overseas'

非正则表达式方法是使用str.joinwithstr.strip和生成器表达式。需要使用 strip() 调用来消除标点符号'.',例如','等。

>>> from string import punctuation
>>> ' '.join(y for y in (x.strip(punctuation) for x in s.split()) if y in word)
'overseas roaming overseas'
于 2014-06-04T07:51:27.687 回答
2

这是一个更简单的方法:

>>> import re
>>> i
"Contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."
>>> words
['roaming', 'overseas', 'samsung']
>>> [w for w in re.findall(r"[\w']+", i) if w in words]
['overseas', 'roaming', 'overseas']
于 2014-06-04T07:55:58.990 回答
2

你可以做的更简单,像这样:

for w in in_data.split():
    if w in word:
        print w

这里我们首先拆分in_databy 空格,它返回一个单词列表。然后,我们遍历 in 数据中的每个单词,并检查该单词是否等于您要查找的单词之一。如果是这样,那么我们打印它。

而且,为了更快地查找,请将word-list 设置为一个集合。快多了。

此外,如果要处理标点符号和符号,则需要使用正则表达式或检查字符串中的所有字符是否都是字母。因此,要获得您想要的输出:

import string
in_words = ('roaming','overseas','samsung')
out_words = []

for w in in_data.split():
    w = "".join([c for c in w if c in string.letters])
    if w in in_words:
        out_words.append(w)
" ".join(out_words)
于 2014-06-04T07:49:01.433 回答
0

使用 split 的答案将落在标点符号上。您需要使用正则表达式分解单词。

import re

in_data = "contacted vodafone about going overseas and asked about roaming charges. The customer support officer says there isn't a charge but while checking my usage overseas."

word = ['roaming','overseas','samsung']
out_data = []

word_re = re.compile(r'[^\w\']+')
for check_word in word_re.split(in_data):
  if check_word in word:
    print check_word
于 2014-06-04T08:04:08.383 回答