python - 迭代python列表的最佳方法是什么，排除某些值并打印出结果

Question

我是 python 新手，有一个问题：
我检查了类似的问题，检查了进入 python的教程，检查了 python 文档、googlebing、类似的 Stack Overflow 问题和其他十几个教程。
我有一段 python 代码读取包含 20 条推文的文本文件。我可以使用以下代码提取这 20 条推文：

with open ('output.txt') as fp:
    for line in iter(fp.readline,''):   
        Tweets=json.loads(line)             
        data.append(Tweets.get('text'))
    i=0
    while i < len(data):                         
        print data[i] 
        i=i+1

上面的 while 循环完美地迭代并打印出 20 条推文（行）output.txt。但是，这 20 行包含非英文字符数据，如"Los ladillo a los dos, soy maaaala o maloooooooooooo"、URL 等"http://t.co/57LdpK"、字符串"None"和带有这样 URL 的照片"Photo: http://t.co/kxpaaaaa（出于隐私考虑，我对此进行了编辑）

我想清除此（即 a list）的输出，并排除以下内容：

参赛None作品
任何以字符串开头的东西"Photo:"
如果我可以排除非 unicode 数据，那也将是一个奖励

我尝试了以下代码

使用data.remove("None:")但我得到错误list.remove(x): x not in list.
将我不想要的项目读入一组，然后对输出进行比较，但没有运气。
研究列表推导，但想知道我是否在这里寻找正确的解决方案。

我来自 Oracle 背景，其中有一些功能可以删除任何想要/不需要的输出部分，所以在过去的 2 个小时里真的绕了一圈。非常感谢任何帮助！

score 3 · Accepted Answer

尝试这样的事情：

def legit(string):
    if (string.startswith("Photo:") or "None" in string):
        return False
    else:
        return True

whatyouwant = [x for x in data if legit(x)]

我不确定这是否适用于您的数据，但您明白了。如果您不熟悉，[x for x in data if legit(x)]则称为列表推导

score 2 · Accepted Answer

首先，只有在Tweet.get('text')有文本条目时才添加：

with open ('output.txt') as fp:
    for line in iter(fp.readline,''):   
        Tweets=json.loads(line)
        if 'text' in Tweets:
            data.append(Tweets['text'])

这不会添加None条目（如果字典中不存在密钥，则.get()返回）。None'text'

我在这里假设您想要进一步处理data您在此处构建的列表。如果不是，您可以省去for entry in data:下面的循环并坚持使用if语句的一个循环。与循环中的Tweets['text']值相同。entryfor entry in data

接下来，您将循环遍历 pythonunicode值，因此请使用这些对象上提供的方法来过滤掉您不想要的内容：

for entry in data:
    if not entry.startswith("Photo:"):
        print entry

您可以在此处使用列表推导；以下内容也将一次性打印所有条目：

print '\n'.join([entry for entry in data if not entry.startswith("Photo:")])

在这种情况下，这并没有给你带来太多好处，因为你正在构建一个大字符串来打印它；您也可以只打印单个字符串并避免字符串构建成本。

请注意，您的所有数据都是 Unicode 数据。您可能想要的是过滤掉使用超出ASCII点的代码点的文本。您可以使用正则表达式来检测文本中是否存在超出 ASCII 的代码点

import re
nonascii = re.compile(ur'[^\x00-0x7f]', re.UNICODE)  # all codepoints beyond 0x7F are non-ascii

for entry in data:
    if entry.startswith("Photo:") or nonascii.search(entry):
        continue  # skip the rest of this iteration, continue to the next
    print entry

非 ASCII 表达式的简短演示：

>>> import re
>>> nonascii = re.compile(ur'[^\x00-\x7f]', re.UNICODE)
>>> nonascii.search(u'All you see is ASCII')
>>> nonascii.search(u'All you see is ASCII plus a little more unicode, like the EM DASH codepoint: \u2014')
<_sre.SRE_Match object at 0x1086275e0>

score 1 · Accepted Answer

我建议如下：

# use itertools.ifilter to remove items from a list according to a function
from itertools import ifilter
import re

# write a function to filter out entries you don't want
def my_filter(value):
    if not value or value.startswith('Photo:'):
        return False

    # exclude unwanted chars
    if re.match('[^\x00-\x7F]', value):
        return False

    return True

# Reading the data can be simplified with a list comprehension
with open('output.txt') as fp:
    data = [json.loads(line).get('text') for line in fp]

# do the filtering
data = list(ifilter(my_filter, data))

# print the output
for line in data:
    print line

关于 unicode，假设您使用的是 python 2.x，该open函数不会将数据读取为 unicode，而是将其作为str类型读取。如果您知道编码，您可能想要转换它，或者使用给定编码读取文件codecs.open。

score 1 · Accepted Answer

with open ('output.txt') as fp:
    for line in fp.readlines():
        Tweets=json.loads(line)
        if not 'text' in Tweets: continue

        txt = Tweets.get('text')
        if txt.replace('.', '').replace('?','').replace(' ','').isalnum():
            data.append(txt)
            print txt

小而简单。
基本原则，一个循环，如果数据符合您的“OK”标准，则添加并打印。

正如Martijn指出的那样，“文本”可能不在所有推文数据中。

正则表达式替换.replace()将遵循以下内容：（if re.match('^[\w-\ ]+$', txt) is not None: 它不适用于空格等，所以是的，如下所述..）

score 1 · Accepted Answer

尝试这个：

with open ('output.txt') as fp:
    for line in iter(fp.readline,''):   
        Tweets=json.loads(line)             
        data.append(Tweets.get('text'))
        i=0
        while i < len(data):
            # these conditions will skip (continue) over the iterations
            # matching your first two conditions.                         
            if data[i] == None or data[i].startswith("Photo"):
                continue
            print data[i] 
            i=i+1

python - 迭代python列表的最佳方法是什么，排除某些值并打印出结果

5 回答 5

Related

Reference