python - Python struggling with data?

Question

The basis of the programme is to convert postcodes(UK version of ZIP codes) into co-ordinates. So I have a file with a load of postcodes(and other attached data such as house prices) and another file with all of the UK postcodes and their correlating co-ordinates.

I turn both of these into lists and then use a for loop inside a for loop to iterate over and compare the postcodes in either file. If postcodes in file1 == postcodes in file2 then the co-ordinates are taken and appended to the relevant file.

I've got my code up and running as I want it too. All of my tests output exactly what I want which is great.

The only problem is that it will only work with small batches of data (I've been testing with .csv files holding ~100 rows - creating lists of 100 inner lists).

Now I want to apply my programme to my entire data set. I ran it once, and nothing happened. I went away, watched some tv and still nothing happened. IDLE wouldn't let me quit the programme or anything. So I restarted and tried again, this time adding in a counter to see if my code was running. I run the code and the counter starts going. Until it hits 78902, the size of my dataset. Then it stops and does nothing. I can't do anything nor can I close the window.

The annoying thing is is that it doesn't even get past reading the CSV file, so I haven't been able to manipulate my data whatsoever.

Here is the code where it gets stuck (the very first part of the code):

    #empty variable to put the list into    
    lst = []
    # List function enables use for all files
    def create_list():

        #find the file
        file2 = input('enter filepath:')
        #read the file and iterate over it to append into the list
        with open(file2, 'r') as f:
            reader = csv.reader(f, delimiter=',')
            for row in reader:
                lst.append(row)
        return lst

So does anyone know a way for me to make my data more manageable?

EDIT: for those interested here is my full code:

from tkinter.filedialog import asksaveasfile
import csv

new_file = asksaveasfile()

lst = []
# List function enables use for all files
def create_list():
    #empty variable to put the list into
    #find the file
    file2 = input('enter filepath:')
    #read the file and iterate over it to append into the list
    with open(file2, 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            lst.append(row)
    return lst


def remove_space(lst):
    '''(lst)->lst
    Returns the postcode value without any whitespace

    >>> ac45 6nh
    ac456nh
    The above would occur inside a list inside a list
    '''
    filetype = input('Is this a sale or crime?: ')
    num = 0
    #check the filetype to find the position of the postcodes
    if filetype == 'sale':
        num = 3
        #iterate over the postcode to add all characters but the space
    for line in range(len(lst)):        
        pc = ''
        for char in lst[line][num]:
            if char != ' ':
                pc = pc+char
        lst[line][num] = pc

def write_new_file(lst, new_file):
    '''(lst)->.CSV file
    Takes a list and writes it into a .CSV file.
    '''
    writer = csv.writer(new_file, delimiter=',')
    writer.writerows(lst)
    new_file.close()


#conversion function
def find_coord(postcode):

    lst = create_list()
    #create python list for conversion comparison
    print(lst[0])
    #empty variables
    long = 0
    lat = 0
    #iterate over the list of postcodes, when the right postcode is found,
    # return the co-ordinates.
    for row in lst:
        if row[1] == postcode:
            long = row[2]
            lat = row[3]
    return str(long)+' '+str(lat)

def find_all_coord(postcode, file):

    #empty variables
    long = 0
    lat = 0
    #iterate over the list of postcodes, when the right postcode is found,
    # return the co-ordinates.
    for row in file:
        if row[1] == postcode:
            long = row[2]
            lat = row[3]
    return str(long)+' '+str(lat)

def convert_postcodes():
    '''
    take a list of lst = []
    #find the file
    file2 = input('enter filepath:')
    #read the file and iterate over it to append into the list
    with open(file2, 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            lst.append(row)
    '''
    #save the files into lists so that they can be used
    postcodes = []
    with open(input('enter postcode key filepath:'), 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            postcodes.append(row)
    print('enter filepath to be converted:')
    file = []
    with open(input('enter filepath to be converted:'), 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            file.append(row)
    #here is the conversion code
    long = 0
    lat = 0
    matches = 0
    for row in range(len(file)):
        for line in range(len(postcodes)):
            if file[row][3] == postcodes[line][1]:
                long = postcodes[line][2]
                lat = postcodes[line][3]
                file[row].append(str(long)+','+str(lat))
                matches = matches+1
                print(matches)
    final_file = asksaveasfile()
    write_new_file(file, final_file)

I call the functions individually from IDLE so I can test it before making the programme run them itself.

score 3 · Accepted Answer

Your problem is that looking up all codes in all files, that makes a huge number of comparisons.

You could try to save that in a dict, with the postral code being the key.

score 2 · Accepted Answer

2

Perhaps you should use the sqlite3 module, load the csv files in there, and use SQL to do the join?

于 2013-09-30T20:52:25.967 回答

score 2 · Accepted Answer

looping through all of this data is inefficient.

One quick and dirty solution is to use SQLite or some other relational data store, to which you can apply an index (if this doesn't fix your problem straight out).

for this and other solutions, you write a quick test using timeit() on each option, and increase the data size to discern the response.

score 2 · Accepted Answer

Your code will be much more efficient if you use dict() instead of list(). General algorithm:

Load data into 2 dictionaries: one for coordinates and another for you info. Both have postcode as the key.
Iterate through shortest of these dictionaries and for each postcode find item with the same postcode in large dictionary. Save matched postcode and coordinate somewhere.

The thing is that dict() has O(1) time complexity for getting by index, whereas list() has O(n) by searching (which is almost the same as to do another loop). For big data this makes a huge difference, in fact you do not need double loop.

score 2 · Accepted Answer

Your main bottleneck is in your convert_postcodes function:

for row in range(len(file)):
    for line in range(len(postcodes)):

If there are N items in file and M items in postcodes then this double-loop requires M*N iterations.

Instead, loop over the items in postcodes once and save the data mapping postcodes to longitude/latitude in a dict. Then loop over file once and use this dict to supply the desired data for each item in file. This will complete the M+N iterations:

def convert_postcodes(postcode_path, file_path, output_path):
    postcodes = dict()
    with open(postcode_path, 'rb') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            code, lng, lat = row[1:4]
            postcodes[code] = [lng, lat]
    with open(file_path, 'rb') as fin, open(output_path, 'wb') as fout:
        reader = csv.reader(fin, delimiter=',')
        writer = csv.writer(fout, delimiter=',')
        for row in reader:
            code = row[3]
            row.extend(postcodes[code])
            writer.writerow(row)

python - Python struggling with data?

5 回答 5

Related

Reference