2

I am using Python to process data from very large text files (~52GB, 800 million lines each with 30 columns of data). I am trying to find an efficient way to find specific lines. Luckily the string is always in the first column.

The whole thing works, memory is not a problem (I'm not loading it, just opening and closing the file as needed) and I run it on a cluster anyway. Its more about speed. The script takes days to run!

The data looks something like this:

scaffold126     1       C       0:0:20:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0     
scaffold126     2       C       0:0:10:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0
scaffold5112     2       C       0:0:10:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0
scaffold5112     2       C       0:0:10:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0

and I am searching for all the lines that start with a particular string from the first column. I want to process the data and send a summary to a output file. Then I search for all the lines for another string and so on...

I am using something like this:

for (thisScaff in AllScaffs):
    InFile = open(sys.argv[2], 'r')
    for line in InFile:
        LineList = line.split()
        currentScaff = LineList[0]
        if (thisScaff == currentScaff):
            #Then do this stuff...

The main problem seems to be that all 800 million lines have to be looked through to find those that match the current string. Then once I move to another string, all 800 have to be looked through again. I have been exploring grep options but is there another way?

Many thanks in advance!

4

3 回答 3

1

Clearly you only want to read the file once. It's very expensive to read it over and over again. To speed searching, make a set of the strings you're looking for. Like so:

looking_for = set(AllScaffs)
with open(sys.argv[2]) as f:
    for line in f:
        if line.split(None, 1)[0] in looking_for:
            # bingo!  found one

line.split(None, 1) splits on whitespace, but at most 1 split is done. For example,

>>> "abc def ghi".split(None, 1)
['abc', 'def ghi']

This is significantly faster than splitting 29 times (which will happen if each line has 30 whitespace-separated columns).

An alternative:

        if line[:line.find(' ')] in looking_for:

That's probably faster still, since no list at all is created. It searches for the leftmost blank, and takes the initial slice of line up to (but not including) that blank.

于 2013-10-23T17:59:28.463 回答
1

Create an Index. It'll require a lot of disk space. Use it only if you have to perform these scaffold lookups too many times.

This will be a one time job, will take a good amount of time, but will definitely serve you in long run.

Your Index will be of the form:

scaffold126:3|34|234443|4564564|3453454
scaffold666:1|2
scaffold5112:4|23|5456456|345345|234234

where 3,4 etc are line numbers. Make sure the final file is sorted alphabetically (to make way for Binary Search). Let's call this Index as Index_Primary

Now you will create a secondary index to make the search faster. Let's call it Index_Second. Let's say Index_Primary contains hundred thousand lines, each line representing one scaffold. Index_Second will give us jump points. It can be like:

scaffold1:1
scaffold1000:850
scaffold2000:1450

This says that information about scaffold2000 is present in line number 1450 of Index_Primary.

So now let's say you want to find lines with scaffold1234, you will go to Index_Second. This will tell you that scaffold1234 is present somewhere between line 850 and 1450 of Index_Primary. Now load that and start from the middle of this block, ie, line 1150. Find the required scaffold using Binary Search and voila! You get the line numbers of lines containing that scaffold! Possibly within milliseconds!

于 2013-10-23T18:01:59.763 回答
0

My first instinct would be to load your data into a database, making sure to create an index from column 0, and then query as needed.

For a Python approach, try this:

wanted_scaffs  = set(['scaffold126', 'scaffold5112'])
files = {name: open(name+'.txt', 'w') for name in wanted_scaffs}
for line in big_file:
    curr_scaff = line.split(' ', 1)[0] # minimal splitting
    if curr_scaff in wanted_scaffs:
        files[key].write(line)
for f in files.values():
    f.close()

Then do your summary reports:

for scaff in wanted_scaffs:
    with open(scaff + '.txt', 'r') as f:
        ... # summarize your data
于 2013-10-23T17:25:26.890 回答