python - 如何根据单独列表中每个字符串的子字符串来获取两个列表之间的差异

Question

我有两个长列表，一个来自一个日志文件，其中包含格式如下的行

201001050843 blah blah blah <email@site.com> blah blah

以及 csv 格式的第二个文件。我需要生成 file2 中不包含日志文件中电子邮件地址的所有条目的列表，同时保持 csv 格式。

示例
日志文件包含：

201001050843 blah blah blah <email@site.com> blah blah
201001050843 blah blah blah <email2@site.com> blah blah

文件 2 包含：

156456,bob,sagget,email@site.com,4564456
156464,bob,otherguy,email@anothersite.com,45644562

输出应该是：

156464,bob,otherguy,email@anothersite.com,45644562

目前我从日志中获取电子邮件并将它们加载到另一个列表中：

sent_emails =[]
for line in sent:
    try:
        temp1= line.index('<')
        temp2 = line.index('>')
        sent_emails.append(line[temp1+1:temp2])
    except ValueError:
        pass

然后与 file2 比较：

lista = mail_lista.readlines()
for line in lista:
    temp = line.split()
    for thing in temp:
        try:
            if thing.index('@'):
                if thing in sent_emails:
                    lista.remove(temp)
        except ValueError:
            pass
newa.writelines(lista)

或者：

for line in mail_listb:
    temp = line.split()
    for thing in temp:
        try:
            if thing.index('@'):
                if thing not in sent_emails:
                    newb.write(line)
        except ValueError:
            pass

但是，两者都返回所有 file2！

谢谢你提供的所有帮助。

编辑：感谢您对套装的建议，它产生的速度差异比我想象的要大。去哈希表的方法！从现在开始，我肯定会更频繁地使用套装。

score 1 · Accepted Answer

line.split()在空白处分裂。改为使用line.split(',')。

另外：行的顺序重要吗？如果不是，那么您应该真正使用 aset()而不是列表。这将使代码更快。

score 1 · Accepted Answer

您可以按照自己的方式创建电子邮件集，然后：

# emails is a set of emails
for line in fileinput.input("csvfile.csv",inplace =1):
    parts = line.split(',')
    if parts[3] not in emails:
        print line

这仅适用于 CSV 文件中的电子邮件始终位于位置 4 的情况。

fileinput启用就地编辑。

并使用一组电子邮件而不是 Aaron 所说的列表，这不仅是因为速度，而且是为了消除重复。

score 0 · Accepted Answer

这是另一种方式，对电子邮件地址的位置进行简单检查。

import fileinput
emails=[]
for line in open("file1"):
    start=line.find("<")
    end=line.find(">")
    if start != -1 and end !=-1:
        emails.append(line[start+1:end])

for line in fileinput.FileInput("file2",inplace=1):
    p = line.split(",")
    for item in p:
        if "@" in item and item not in emails:
            print line.strip()

输出

$ ./python.py
156464,bob,otherguy,email@anothersite.com,45644562

python - 如何根据单独列表中每个字符串的子字符串来获取两个列表之间的差异

3 回答 3

Related

Reference