1

I have a Textfile containing numbers that looks as follows:

[mpz(0), mpz(0), mpz(0), mpz(0), mpz(4), mpz(54357303843626),...]

Does there exist a simple way to parse it directly into an integer list? It doesn't matter whether the target data type is a mpz integer or a plain python integer.

What I tried so far and works is pure parsing (note: the target array y_val3 needs to be initialized with zeros in advance, since it may be larger than the list in the Textfile):

text_file = open("../prod_sum_copy.txt", "r")
content = text_file.read()[1:-1]
text_file.close()
content_list = content.split(",")
y_val3 = [0]*10000
print(content_list)
for idx, str in enumerate(content_list):
    m = re.search('mpz\(([0-9]+)\)', str)
    y_val3[idx]=int(m.group(1))
print(y_val3)

Althought this approach works, I am not sure if this is a best practice or wether there exist a more elegant way than just plain parsing.

To facilitate things: Here is the original Textfile on GitHub. Note: This Textfile might grow in furure, which brings aspects such as performance and scalability into play.

4

2 回答 2

2

I tried look at a more elegant solution from both the human-readable perspective and from the performance perspective.

Caveats:

  • There is a lot going on here
  • I do not have the original file, so the numbers below will not match any numbers you might get on your device
  • There is too much work to try and benchmark all the various parts so I tried to focus on several of the biggest components

The breakouts and timing below seem to show an order of magnitude difference in several of the approaches, so they may still be of use in gauging level of computational effort.

My first approach was to try and measure the amount of overhead the file read/write added to the process so that we could explore how much computational effort was focused on just the data processing step.

To do this, I made a function that included the file read and measured the whole process, end to end to see how long it took with my mini example file. I did this using %timeit in a Jupyter notebook.

I then broke out the file reading step into it's own function and then used %timeit on just the data processing step to help show us:

  • how much time was used by file reads vs data processing in the original approach
  • how much time was used by the data processing approach in the improved approach.

Original Approach (in a function)

import re

def original():
    text_file = open("../prod_sum_copy.txt", "r")
    content = text_file.read()[1:-1]
    text_file.close()
    content_list = content.split(",")

    y_val3 = [0]*10000

    for idx, element in enumerate(content_list):
        m = re.search('mpz\(([0-9]+)\)', element)
        y_val3[idx]=int(m.group(1))
    return y_val3

I am gonna presume that a significant portion of the processing time for my really short example data is just gonna be the time used to open the file on disk, read the data into memory, close the file, etc.

%timeit original()
140 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Separate Readfile from Data Processing Approach

This approach includes a minor improvement to the file reading process. The timing test does not include the file reading process, so we won't know how much that minor change affects the overall process. For the record, I eliminated the manual call to the .close() method by encapsulating the reading process in a with context manager (which handles closing in the background) as this is a Python best practice for reading in files.

import re

def read_filea():
    with open("../prod_sum_copy.txt", "r") as text_file:
        content = text_file.read()[1:-1]
        return content

content = read_filea()
print(content)
def a():
    y_val3 = [0]*10000
    content_list = content.split(",")
    for idx, element in enumerate(content_list):
        m = re.search('mpz\(([0-9]+)\)', element)
        y_val3[idx]=int(m.group(1))
    return y_val3

By timing just the data processing portion, we see that it appears as though our prediction that file read (IO) plays a big component in this simple test case. It also provides us with an idea for how much time we should expect to take for just the data processing portion. Let's look at another approach to see if we can trim that time down a bit.

%timeit read_filea()
21.5 µs ± 185 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Simplified Data Processing Approach (and Separate Readfile)

Here we will try to use some Python best practices OR Python tools to cut down on the overall time, including:

  • list comprehension
  • use of the re.findall() method to eliminate some of the direct and repeated calls to the re.search() function and the direct and repeated calls to the m.group() method (NOTE: findall is likely doing some of that in the background and I honestly don't know if us avoiding it will have a benefit). BUT I find the readability of this approach to be higher than the original approach.

Let's look at the code:

import re

def read_fileb():
    with open("../prod_sum_copy.txt", "r") as text_file:
        content = text_file.read()[1:-1]
    return content

content = read_fileb()

def b():
    y_val3 = [int(element) for element in re.findall(r'mpz\(([0-9]+)\)', content)]
    return y_val3

The data processing portion of this approach is about 10 times faster than the data processing steps in the original approach.

%timeit b()
2.89 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)  
于 2022-01-23T13:33:39.630 回答
1

There is one clever trick how to convert back data from printed by Python format to original objects. Just do obj = eval(string), full example below.

You can use this eval solution for almost any python object, even complex that was printed to file through print(python_object) or similar. Basically enything that is a valid python code can be converted from string by eval().

eval() allows not to use any string processing/parsing functions at all, no regular expressions or whatever.

Beware that eval() doesn't check what string it runs, so string inside can have malicious code if it came from unknown source, this code can do anything to your PC, so do eval() only with trusted strings of code.

Code below used text string with example file content. I used string, not file as an example, so that my code is fully runnable by StackOverflow visitors, without dependencies. In case of read-only opened file f you just replace for line in text.split('\n'): with for line in f: and that's it, code works.

Try it online!

from gmpy2 import mpz

text = '''
[mpz(12), mpz(34), mpz(56)]
[mpz(78), mpz(90), mpz(21)]
'''

nums = []
for line in text.split('\n'):
    if not line.strip():
        continue
    nums.append(eval(line))

print(nums)

Output:

[[mpz(12), mpz(34), mpz(56)], [mpz(78), mpz(90), mpz(21)]]
于 2022-01-24T11:02:23.987 回答