1

I have 2 huge xml files. One is around 40GB, the other is around 2GB. Assume the xml format is something like this

< xml >
    ...
    < page >
        < id > 123 < /id >
        < title > ABC < /title >
        < text > .....
            .....
            .....
        < /text >
    < /page >
    ...
< /xml >

I have created an index file for both file 1 and file 2 using mmap.
Each of the index files complies with this format:

Id  <page>_byte_position    </page>_byte_position   

So, basically given an Id, from the index files, I know where the tag starts for that Id and where it ends i.e. tag byte pos.

Now, what I need to do is: - I need to be able to figure out for each id in the smaller index file (for 2GB), if the id exists in the larger index file - If the id exists, I need to be able to get the _byte_pos and _byte_pos for that id from the larger index file (for 40GBfile )

My current code is awfully slow. I guess I am doing an O(m*n) algorithm assuming m is size of larger file and n of smaller file.

with open(smaller_idx_file, "r+b") as f_small_idx:
    for line in f_small_idx.readlines():
        split = line.split(" ")
        with open(larger_idx_file, "r+b") as f_large_idx:
            for line2 in f_large_idx.readlines():
                split2 = line2.split(" ")
                if split[0] in split2:
                    print split[0] 
                    print split2[1] + "  " + split2[2]

This is AWFULLY slow !!!!
Any better suggestions ??

Basically, given 2 huge files, how do you search if each word in a particular column in smaller file exists in the huge file and if it does, you need to extract other relevant fields as well.

Any suggestions would be greatly appreciated!! : )

4

1 回答 1

2

Don't have time for an elaborate answer right now but this should work (assuming the temporary dict will fit into memory):

  1. Iterate over smaller file and put all the words of the relevant column in a dict (lookup in a dict has an average case performance of O(1))
  2. Iterate over larger file and look up each word in the dict storing the relevant information either directly with the dict entries or elsewhere.

If this does not work I would suggest sorting (or filtering) the files first so that chunks can then be processed independently (i.e. compare only everything that starts with A then B...)

于 2013-05-13T16:33:02.743 回答