I have 2 huge xml files. One is around 40GB, the other is around 2GB. Assume the xml format is something like this
< xml >
...
< page >
< id > 123 < /id >
< title > ABC < /title >
< text > .....
.....
.....
< /text >
< /page >
...
< /xml >
I have created an index file for both file 1 and file 2 using mmap.
Each of the index files complies with this format:
Id <page>_byte_position </page>_byte_position
So, basically given an Id, from the index files, I know where the tag starts for that Id and where it ends i.e. tag byte pos.
Now, what I need to do is: - I need to be able to figure out for each id in the smaller index file (for 2GB), if the id exists in the larger index file - If the id exists, I need to be able to get the _byte_pos and _byte_pos for that id from the larger index file (for 40GBfile )
My current code is awfully slow. I guess I am doing an O(m*n) algorithm assuming m is size of larger file and n of smaller file.
with open(smaller_idx_file, "r+b") as f_small_idx:
for line in f_small_idx.readlines():
split = line.split(" ")
with open(larger_idx_file, "r+b") as f_large_idx:
for line2 in f_large_idx.readlines():
split2 = line2.split(" ")
if split[0] in split2:
print split[0]
print split2[1] + " " + split2[2]
This is AWFULLY slow !!!!
Any better suggestions ??
Basically, given 2 huge files, how do you search if each word in a particular column in smaller file exists in the huge file and if it does, you need to extract other relevant fields as well.
Any suggestions would be greatly appreciated!! : )