python - 在 Python 中按事件解析数据

Question

我正在尝试使用 Python 从格式如下的文本文件中解析数据：

<event>  
A   0.8
B    0.4  0.3 -0.5  0.3
</event>
<event>  
A   0.2
B    0.3  0.2 -0.5  0.8
C    0.1  0.3 -0.3  0.2
C   -0.2  0.4 -0.1  0.9
</event>
<event>  
A   0.4
B    0.4  0.3 -0.5  0.3
C    0.3  0.7  0.6  0.5
</event>

变量 A 和 B 始终存在于每个事件中，但正如您所见，C 变量在一个事件中最多可以出现两次，有时根本不会出现。总共有大约 10,000 多个事件。

我想格式化所有这些，以便我可以单独调用每条数据（即事件 3 中变量 B 的第 2 列），以及分组（即绘制变量 A，所有事件的第 0 列），但是重复 C 变量让我有点吃惊。理想情况下，我希望为 C 变量 #1 和 C 变量 #2 提供一列数据，其中当事件中只有一个或零个 C 变量时，数据可以简单地为 0。

我的代码目前远非优雅，输出格式也不是它所需要的，所以我喜欢关于如何简化和改进它的建议。

M = 10000        # number of events 
file = open('data.txt')
a_lines = open('a.txt','w')
b_lines = open('b.txt','w')
c1_lines = open('c1.txt','w')
c2_lines = open('c2.txt','w') 
c1 = []
c2 = []

for i in range(M):
    for line in file: 
        if not line.strip():
            continue  
        if line.startswith("</event>"):
            break 
        elif line.startswith("<event>"):
            a = file.next()
            print >>a_lines,i,a

for i in range(M):
    for line in file: 
        if line.startswith("B"):
            print >>b_lines,i,line.strip()
            nextline=file.next().strip()
            c1.append(nextline)
            nextline2=file.next().strip()
            c2.append(nextline2)
            break

# Parsing the duplicate C columns...
# I've formatted it so the 0 is aligned with the other data

for i in range(M):
    if "C" in c1[i]:
        print >>c1_lines, i, c1[i]
    else: 
        print >>c1_lines, i, "C    0" 


for i in range(M):
    if "C" in c2[i]:
        print >>c2_lines, i, c2[i]
    else: 
        print >>c2_lines, i, "C    0"

#  Sample variable formatting attempt: 

b_event_num,b_0,b_1,b_2,b_3=loadtxt("b.txt",usecols=(0,1,2,3,4),unpack=True)
b_0=array(b_0)
b_1=array(b_1)
b_2=array(b_2)
b_3=array(b_3)
b_0=b_0.reshape((len(b_0)),1)
b_1=b_1.reshape((len(b_1)),1)
b_2=b_2.reshape((len(b_2)),1)
b_3=b_3.reshape((len(b_3)),1)
b_points=np.hstack((b_0,b_1,b_2,b_3))

提取的数据本身看起来不错，但是当我尝试在列中加载时，出现以下错误，我不知道为什么：

vals = [vals[i] for i in usecols]
IndexError: list index out of range

任何帮助，将不胜感激; 谢谢！

score 0 · Accepted Answer

The IndexError is coming from trying to access vals[0] when vals = []. If you expand your code the error might make more sense:

vals = []
for i in usecols:
    vals[i] = i

The error happens in the first use of the loop because vals[0] isn't in the list. I would suggest a fix, but I'm not sure what your trying to do. If you just want vals to be the list [0,1,2,3,4] you can just use

vals = range(5)

Edit: On a side note I don't think that saving it in a separate file is necessary. It would be a lot better to just save it directly into the array, like:

M = 10000        # number of events 
file = open('data.txt')
a = []
b = []
c2 = []
c2 = []

def parseLine(line, section):
    line = line.split()
    line = line[1:]  # To take out the letter at the start
    section.append(line)

file.next()
for i in range(M):
    parseLine(file.next(), a)
    parseLine(file.next(), b)
    nextLine = file.next()
    if nextLine.startswith("C"):
        parseLine(nextLine, c1)
        nextLine = file.next()
        if nextLine.startswith("C"):
            parseLine(nextLine, c2)
            file.next()    # To get to the end of the event
        else:
            c2.append([0])
    else:
        c1.append([0])
        c2.append([0])
    file.next()

Be careful though because to get the element from the 2nd element from the 8th event for b you would do b[7][1], so it's b[event-1][column-1]

python - 在 Python 中按事件解析数据

1 回答 1

Related

Reference