python - 在 python 中处理大计算的内存使用

Question

我正在尝试使用 python 进行一些计算，但内存不足。因此，我想读/写一个文件以释放内存。我需要一个非常大的列表对象之类的东西，所以我想为文件中的每个对象写一行并读/写该行而不是内存。行排序对我来说很重要，因为我将使用行号作为索引。所以我想知道如何在不移动其他行的情况下替换 python 中的行（实际上，移动行是可以的，只要它们返回到我期望的位置）。

编辑

我正在尝试帮助一个朋友，这在 python 中比我差或等于我。该代码应该找到最大的素数，除以给定的非素数。此代码适用于数字，直到数字达到 100 万，但死后，我的记忆力在尝试制作数字列表时耗尽。

# a comes from a user input
primes_upper_limit = (a+1) / 2
counter = 3L
numbers = list()
while counter <= primes_upper_limit:
    numbers.append(counter)
    counter += 2L

counter=3
i=0
half = (primes_upper_limit + 1) / 2 - 1
root = primes_upper_limit ** 0.5
while counter < root:
    if numbers[i]:
        j = int((counter*counter - 3) / 2)
        numbers[j] = 0
        while j < half:
            numbers[j] = 0
            j += counter
    i += 1
    counter = 2*i + 3
primes = [2] + [num for num in numbers if num]
for numb in reversed(primes):
    if a % numb == 0:
        print numb
        break

另一个编辑

为每个索引写入不同的文件怎么样？例如十亿个长整数文件名的文件，文件中只有一个数字？

score 2 · Accepted Answer

You want to find the largest prime divisor of a. (Project Euler Question 3) Your current choice of algorithm and implementation do this by:

Generate a list numbers of all candidate primes in range (3 <= n <= sqrt(a), or (a+1)/2 as you currently do)
Sieve the numbers list to get a list of primes {p} <= sqrt(a)
Trial Division: test the divisibility of a by each p. Store all prime divisors {q} of a.
Print all divisors {q}; we only want the largest.

My comments on this algorithm are below. Sieving and trial division are seriously not scalable algorithms, as Owen and I comment. For large a (billion, or trillion) you really should use NumPy. Anyway some comments on implementing this algorithm:

Did you know you only need to test up to √a, int(math.sqrt(a)), not (a+1)/2 as you do?
There is no need to build a huge list of candidates numbers, then sieve it for primeness - the numbers list is not scalable. Just construct the list primes directly. You can use while/for-loops and xrange(3,sqrt(a)+2,2) (which gives you an iterator). As you mention xrange() overflows at 2**31L, but combined with the sqrt observation, you can still successfully factor up to 2**62
In general this is inferior to getting the prime decomposition of a, i.e. every time you find a prime divisor p | a, you only need to continue to sieve the remaining factor a/p or a/p² or a/p³ or whatever). Except for the rare case of very large primes (or pseudoprimes), this will greatly reduce the magnitude of the numbers you are working with.
Also, you only ever need to generate the list of primes {p} once; thereafter store it and do lookups, not regenerate it. So I would separate out generate_primes(a) from find_largest_prime_divisor(a). Decomposition helps greatly.

Here is my rewrite of your code, but performance still falls off in the billions (a > 10**11 +1) due to keeping the sieved list. We can use collections.deque instead of list for primes, to get a faster O(1) append() operation, but that's a minor optimization.

# Prime Factorization by trial division

from math import ceil,sqrt
from collections import deque

# Global list of primes (strictly we should use a class variable not a global)
#primes = deque()
primes = []

def is_prime(n):
    """Test whether n is divisible by any prime known so far"""
    global primes
    for p in primes:
         if n%p == 0:
             return False #  n was divisible by p
    return True # either n is prime, or divisible by some p larger than our list    
def generate_primes(a):
    """Generate sieved list of primes (up to sqrt(a)) as we go"""
    global primes
    primes_upper_limit = int(sqrt(a))
    # We get huge speedup by using xrange() instead of range(), so we have to seed the list with 2
    primes.append(2)
    print "Generating sieved list of primes up to", primes_upper_limit, "...",
    # Consider prime candidates 2,3,5,7... in increasing increments of 2
    #for number in [2] + range(3,primes_upper_limit+2,2):
    for number in xrange(3,primes_upper_limit+2,2):
        if is_prime(number): # use global 'primes'
            #print "Found new prime", number
            primes.append(number) # Found a new prime larger than our list
    print "done"    
def find_largest_prime_factor(x, debug=False):
    """Find all prime factors of x, and return the largest."""
    global primes
    # First we need the list of all primes <= sqrt(x)    
    generate_primes(x)
    to_factor = x # running value of the remaining quantity we need to factor
    largest_prime_factor = None
    for p in primes:
        if debug: print "Testing divisibility by", p
        if to_factor%p != 0:
            continue
        if debug: print "...yes it is"
        largest_prime_factor = p
        # Divide out all factors of p in x (may have multiplicity)
        while to_factor%p == 0:
            to_factor /= p
        # Stop when all factors have been found
        if to_factor==1:
            break
    else:
        print "Tested all primes up to sqrt(a), remaining factor must be a single prime > sqrt(a) :", to_factor
    print "\nLargest prime factor of x is", largest_prime_factor
    return largest_prime_factor

score 0 · Accepted Answer

如果我的理解正确，这不是一件容易的事。他们按照我的解释，您希望保持文件句柄打开，并将文件用作存储字符数据的地方。

假设您有一个文件，例如，

a
b
c

and you wanted to replace 'b' with 'bb'. That's going to be a pain, because the file actually looks like a\nb\nc -- you can't just overwrite the b, you need another byte.

My advice would be to try and find a way to make your algorithm work without using a file for extra storage. If you got a stack overflow, chances are you didn't really run out of memory, you overran the call stack, which is much smaller.

You could try reworking your algorithm to not be recursive. Sometimes you can use a list to substitute for the call stack -- but there are many things you could do and I don't think I could give much general advice not seeing your algorithm.

edit

Ah I see what you mean... when the list

while counter <= primes_upper_limit:
    numbers.append(counter)
    counter += 2L

grows really big, you could run out of memory. So I guess you're basically doing a sieve, and that's why you have the big list numbers? It makes sense. If you want to keep doing it this way, you could try a numpy bool array, because it will use substantially less memory per cell:

import numpy as np

numbers = np.repeat(True, a/2)

Or (and maybe this is not appealing) you could go with an entirely different approach that doesn't use a big list, such as factoring the number entirely and picking the biggest factor.

Something like:

factors = [ ]
tail = a

while tail > 1:
    j = 2
    while 1:
        if tail % j == 0:
            factors.append(j)
            tail = tail / j
            print('%s %s' % (factors, tail))
            break
        else:
            j += 1

ie say you were factoring 20: tail starts out as 20, then you find 2 tail becomes 10, then it becomes 5.

This is not terrible efficient and will become way too slow for a large (billions) prime number, but it's ok for numbers with small factors.

I mean your sieve is good too, until you start running out of memory ;). You could give numpy a shot.

score 0 · Accepted Answer

pytables is excellent for working with and storing huge amounts of data. But first start with implementing the comments in smci's answer to minimize the amount of numbers you need to store.

score 0 · Accepted Answer

For a number with only twelve digits, as in Project Euler #3, no fancy integer factorization method is needed, and there is no need to store intermediate results on disk. Use this algorithm to find the factors of n:

Set f = 2.
If n = 1, stop.
If f * f > n, print n and stop.
Divide n by f, keeping both the quotient q and the remainder r.
If r = 0, print q, divide n by q, and go to Step 2.
Otherwise, increase f by 1 and go to Step 3.

This just does trial division by every integer until it reaches the square root, which indicates that the remaining cofactor is prime. Each factor is printed as it is found.

python - 在 python 中处理大计算的内存使用

编辑

4 回答 4

Related

Reference