3

I'm new to python and programming. I need some help with a python script. There are two files each containing email addresses (more than 5000 lines). Input file contains email addresses that I want to search in the data file(also contains email addresses). Then I want to print the output to a file or display on the console. I search for scripts and was able to modify but I'm not getting the desired results. Can you please help me?

dfile1 (50K lines)
yyy@aaa.com
xxx@aaa.com
zzz@aaa.com


ifile1 (10K lines)
ccc@aaa.com
vvv@aaa.com
xxx@aaa.com
zzz@aaa.com

Output file
xxx@aaa.com
zzz@aaa.com



datafile = 'C:\\Python27\\scripts\\dfile1.txt'
inputfile = 'C:\\Python27\\scripts\\ifile1.txt'

with open(inputfile, 'r') as f:
names = f.readlines()

outputlist = []

with open(datafile, 'r') as fd:
  for line in fd:
    name = fd.readline()
    if name[1:-1] in names:
        outputlist.append(line)
    else:
        print "Nothing found"
 print outputlist

New Code

with open(inputfile, 'r') as f:
    names = f.readlines()
outputlist = []

with open(datafile, 'r') as f:
    for line in f:
        name = f.readlines()
        if name in names:
            outputlist.append(line)
        else:
            print "Nothing found"
    print outputlist
4

5 回答 5

2

mitan8 gives the problem you have, but this is what I would do instead:

with open(inputfile, "r") as f:
    names = set(i.strip() for i in f)

output = []

with open(datafile, "r") as f:
    for name in f:
        if name.strip() in names:
            print name

This avoids reading the larger datafile into memory.

If you want to write to an output file, you could do this for the second with statement:

with open(datafile, "r") as i, open(outputfile, "w") as o:
    for name in i:
        if name.strip() in names:
            o.write(name)
于 2013-11-12T16:16:34.070 回答
2

Maybe I'm missing something, but why not use a pair of sets?

#!/usr/local/cpython-3.3/bin/python

data_filename = 'dfile1.txt'
input_filename = 'ifile1.txt'

with open(input_filename, 'r') as input_file:
    input_addresses = set(email_address.rstrip() for email_address in input_file.readlines())

with open(data_filename, 'r') as data_file:
    data_addresses = set(email_address.rstrip() for email_address in data_file.readlines())

print(input_addresses.intersection(data_addresses))
于 2013-11-12T16:24:32.343 回答
1

I think you can remove name = fd.readline() since you've already got the line in the for loop. It'll read another line in addition to the for loop, which reads one line every time. Also, I think name[1:-1] should be name, since you don't want to strip the first and last character when searching. with automatically closes the files opened.

PS: How I'd do it:

with open("dfile1") as dfile, open("ifile") as ifile:
    lines = "\n".join(set(dfile.read().splitlines()) & set(ifile.read().splitlines())
print(lines)
with open("ofile", "w") as ofile:
    ofile.write(lines)

In the above solution, basically I'm taking the union (elements part of both sets) of the lines of both the files to find the common lines.

于 2013-11-12T15:58:15.717 回答
1

I think your issue stems from the following:

name = fd.readline()
if name[1:-1] in names:

name[1:-1] slices each email address so that you skip the first and last characters. While it might be good in general to skip the last character (a newline '\n'), when you load the name database in the "dfile"

with open(inputfile, 'r') as f:
    names = f.readlines()

you are including newlines. So, don't slice the names in the "ifile" at all, i.e.

if name in names:
于 2013-11-12T16:01:54.923 回答
1

Here's what I would do:

names=[]
outputList=[]
with open(inputfile) as f:
    for line in f:
        names.append(line.rstrip("\n")

myEmails=set(names)

with open(outputfile) as fd, open("emails.txt", "w") as output:
    for line in fd:
        for name in names:
            c=line.rstrip("\n")
            if name in myEmails:
                print name #for console
                output.write(name) #for writing to file
于 2013-11-12T16:03:44.383 回答