python - Python拆分特定案例处理

Question

即使使用拆分功能，几个数字也不会拆分

707 K   -7 -7 -6 -8 -2 -5 -8 -8  2 -5 -4 -7 -6  6 -8 -6 -7 -4  8 -6
708 L    0  0 -2 -3 -3  1  3 -3  0 -1 -3  4 -2 -5 -2  0  0 -3 -2  0
709 V   -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -5 -7 -9 -4 -7
710 P   -1 -2  2  1 -2  1  2 -1  1 -4 -4  2 -3 -4  3  1  0 -1 -3 -4
711 E   -3 -3 -3  1 -6  1  5 -3  2  0 -1 -1 -1 -1 -5 -2 -1 -4  0  0
712 C   -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -4 -7 -9 -9 -7
713 S   -4 -4  1  1 -5 -2 -1  6 -1 -8 -7 -3 -7 -4 -2 -1 -3 -4 -4 -7

这个矩阵来自一个文本文件并且非常大（文件中的行数以及此类文件的数量）。

我成功阅读了python中的行并将它们拆分为

m = fp.readline(); [fp is the file pointer and reading is done in loop]

m = m.split() ; [splitting by elements ]

m = map(int,m[2:22]); [ mapping to make strings as integers from index 2 to 22 ]

但这会给第 709 行和第 712 行带来错误（见矩阵的最左端）

Traceback (most recent call last):
  File "<pyshell#150>", line 1, in <module>
    map(int,m[2:22].split())
ValueError: invalid literal for int() with base 10: '-7-10'

那是因为由于文件格式错误，'-7-10' 没有按预期拆分'-7' '-10'。

所以问题是如何处理这种格式错误，以便像矩阵中的其他行一样拆分和处理整数？请记住，对于非常大的行和文件，必须这样做，因此手动编辑格式错误是不可行的，尽管单个文件中的此类错误在 100 个以内。请帮助我...谢谢

score 2 · Accepted Answer

您可以replace对所有负号执行 a 以确保它们前面至少有一个空格：

m = "-7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -5 -7 -9 -4 -7"
print m.replace("-", " -").split()

结果：

['-7', '-10', '-9', '-10', '12', '-9', '-10', '-9', '-9', '-8', '-8', '-9', '-8', '-9', '-9', '-5', '-7', '-9', '-4', '-7']

当然，只有当它是一个与其邻居相抵触的负数时，这才有帮助。如果您有冲突值，例如：

707 K   -1 -2 -3
707 K   -4123 -6

然后你不能轻易地将 -4 和 123 分开。

score 2 · Accepted Answer

您可以（正如西蒙在评论中指出的那样）只需按固定位置解析数字；这种理解是通过从 7 循环到 67、步进 3 并将子字符串转换为整数来实现的；（您的示例中的字段似乎从 posiiton 7 开始，每个长度为 3 个字符）

>>> m
'709 V   -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -5 -7 -9 -4 -7\n'

>>> a = [int(m[pos:pos+3]) for pos in range(7,67,3)]

>>> a
[-7, -10, -9, -10, 12, -9, -10, -9, -9, -8, -8, -9, -8, -9, -9, -5, -7, -9, -4, -7]

score 0 · Accepted Answer

您可以使用该re模块查找与特定正则表达式模式匹配的所有子字符串：

>>> x = "-7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -4 -7 -9 -9 -7"
>>> import re
>>> re.findall( r"-?[0-9]+", x )
['-7', '-10', '-9', '-10', '12', '-9', '-10', '-9', '-9', '-8', '-8', '-9', '-8', '-9', '-9', '-4', '-7', '-9', '-9', '-7']

当然，如果您可能会被7, 123格式化为7123，那么您唯一的选择是通过索引而不是内容模式来分割字符串。

score 0 · Accepted Answer

由于您正在处理的数据是固定格式的数据，因此最好使用struct模块并解析数据，而不是使用split

>>> from struct import *
>>> line="712 C   -7-10 -9-10 12 -9-10 -9 -9 -8 -8 -9 -8 -9 -9 -4 -7 -9 -9 -7"
>>> format_string = "<3s2s2s"+"3s"*20
>>> unpack(format_string,line)
('712', ' C', '  ', ' -7', '-10', ' -9', '-10', ' 12', ' -9', '-10', ' -9', ' -9', ' -8', ' -8', ' -9', ' -8', ' -9', ' -9', ' -4', ' -7', ' -9', ' -9', ' -7')

python - Python拆分特定案例处理

4 回答 4

Related

Reference