python - Python加速从超大字符串中检索数据

Question

往下看：

我有一个列表，我在尝试编辑它时将其转换为一个非常长的字符串，因为您可以收集它称为 tempString。到目前为止它的工作原理只是需要很长时间才能运行，可能是因为它是几个不同的正则表达式子。它们如下：

tempString = ','.join(str(n) for n in coords)
tempString = re.sub(',{2,6}', '_', tempString)
tempString = re.sub("[^0-9\-\.\_]", ",", tempString)
tempString = re.sub(',+', ',', tempString)
clean1 = re.findall(('[-+]?[0-9]*\.?[0-9]+,[-+]?[0-9]*\.?[0-9]+,'
                 '[-+]?[0-9]*\.?[0-9]+'), tempString)
tempString = '_'.join(str(n) for n in clean1)
tempString = re.sub(',', ' ', tempString)

基本上它是一个长字符串，包含逗号和大约 1-5 百万组 4 个浮点数/整数（两者可能的混合），：

-5.65500020981,6.88999986649,-0.454999923706,1,,,-5.65500020981,6.95499992371,-0.454999923706,1,,,

我不需要/想要的每组中的第 4 个数字，我实际上只是想将字符串拆分为一个列表，每个列表中有 3 个浮点数，每个浮点数由空格分隔。

上面的代码完美无缺，但你可以想象在大字符串上相当耗时。

我在这里做了很多研究以寻求解决方案，但它们似乎都针对单词，即将一个单词换成另一个单词。

编辑： 好的，这是我目前使用的解决方案：

def getValues(s):
    output = []
    while s:
        # get the three values you want, discard the 3 commas, and the 
        # remainder of the string
        v1, v2, v3, _, _, _, s = s.split(',', 6)
        output.append("%s %s %s" % (v1.strip(), v2.strip(), v3.strip()))         
    return output
coords = getValues(tempString)

有人有什么建议可以进一步加快速度吗？运行一些测试后，它仍然需要比我希望的更长的时间。

我一直在看 numPy，但老实说，我完全不知道如何使用它，我知道在完成上述操作并清理值之后，我可以使用 numPy 更有效地使用它们，但不确定如何NumPy 可以适用于上述情况。

The above to clean through 50k sets takes around 20 minutes, I cant imagine how long it would be on my full string of 1 million sets. I'ts just surprising that the program that originally exported the data took only around 30 secs for the 1 million sets

score 2 · Accepted Answer

Based on your sample data:

>>> s = "-5.65500020981,6.88999986649,-0.454999923706,1,,,-5.65500020981,6.95499992371,-0.454999923706,1,,,"
>>> def getValues(s):
...     output = []
...     while s:
...         # get the three values you want, discard the 3 commas, and the 
...         # remainder of the string
...         v1, v2, v3, _, _, _, s = s.split(',', 6)
...         output.append("%s %s %s" % (v1, v2, v3))
...         
...     return output
>>> getValues(s)
['-5.65500020981 6.88999986649 -0.454999923706', '-5.65500020981 6.95499992371 -0.454999923706']

...once you have those parsed values as strings in a list you can do whatever else you need to do.

Or if you prefer, use a generator so you don't need to build the entire return string at once:

>>> def getValuesGen(s):
...     while s:
...         v1, v2, v3, _, _, _, s = s.split(',', 6)
...         yield "%s %s %s" % (v1, v2, v3)
>>> for v in getValuesGen(s):
...     print v
...     
... 
-5.65500020981 6.88999986649 -0.454999923706
-5.65500020981 6.95499992371 -0.454999923706

You may also want to try an approach that pre-splits your long list on the ,,, set of commas instead of continually building and processing a set of shorter strings, like:

>>> def getValues(s):
...     # split your long string into a list of chunked strings
...     strList = s.split(",,,")
...     for chunk in strList:
...         if chunk:
...         # ...then just parse apart each individual set of data values
...             vals = chunk.split(',')
...             yield "%s %s %s" % (vals[0], vals[1], vals[2])
>>> for v in getValues(s10):
...     print v
-5.1  6.8  -0.454
-5.1  6.8  -0.454
-5.1  6.8  -0.454
-5.1  6.8  -0.454
-5.1  6.8  -0.454
-5.1  6.8  -0.454
-5.1  6.8  -0.454
-5.1  6.8  -0.454
-5.1  6.8  -0.454
-5.1  6.8  -0.454

At some point when you're dealing with huge data sets like this and have speed issues it starts to make sense to push things down into modules that are doing the hard work in C, like NumPy.

score 0 · Accepted Answer

One way to reduce the memory drain without having to change anything in the regex would be to use the re.finditer() method instead of re.findall(). This would iterate through the values one-by-one rather than reading the entire string into a single list object. http://docs.python.org/library/re.html#re.finditer

python - Python加速从超大字符串中检索数据

2 回答 2

Related

Reference