0

I currently got a program in Python that reads a text file but it loses its formatting while staying on memory for a couple of reasons, but it keeps as information the line and column of it. I would be interested on using this line and column information to reproduce the file as it was originally read. It is ok if the column doesn't match in amount of spaces or tabs in comparison to the original as long it is consistent thru the new file.

One first naive solution that occurred to me was to always keep some pointer to line 1 and column 1 and spam \n and white spaces using the line and column information, but I was wondering if there is a better way to do that in Python (in fact I don't know how to do this pointer to first line and column either).

Some method that would take as parameters a string, the line, column, and the file as four parameters in Python and would maybe be a possible solution, although I am unsure in this case what would occur if (line,column) is occupied (this would never occur in my situation so is not really a concern).

Edit: The information is stored on a complicated 'structure', but it suffices to say that I can extract such information as a list of strings, where each string has an associated line and column information. I would then use this 'method' to take each string and its column and line to add to the file on the right position.

Edit 2: The only assumption is that when getting every word from the original file they will happen on exactly the same order. That is to say, if the original file is "The cat jumped \n but did not die" then it is expected that I will be taken the strings: ' the', 'cat', 'jumped', 'and', 'didn't', 'die' and its associated line and columns. In that case, 'but', 'did', 'not' and 'die' will have line 2 instead of 1 and all words their associated columns (which may or may not overlap since its a different line).

Thank you.

4

2 回答 2

0

不确定它是否有效,我确定它需要一些工作。我已经使用 cat 示例来模拟支持数据,然后将其作为文本放回......没有错误检查,但我认为这是它的基本原理......

import re
from operator import itemgetter

test = "The cat jumped \n but did not die"
lines = test.splitlines()
line_ref = []
for line in lines:
    words = list(re.finditer(r'(\S+)', line))
    line_ref.append((len(line), dict( (m.span(), m.group()) for m in words) ))


output = []
for line in line_ref:
    last = max(line[1], key=itemgetter(1))[1]
    textlist = [' '] * max(last, line[0])
    for (start, end), word in line[1].iteritems():
        textlist[start:end] = word
    output.append(''.join(textlist))

print '\n'.join(output)
于 2012-07-22T23:34:51.113 回答
0

您需要根据行号 (y) 对内存中的行进行排序。然后对于范围 (1..N) 中的 i,N = 原始文件中每页的行数,您将:

- if there are rows with that y:
    - sort all rows with that y in that page using their x
    - start with j = 0, and for each text chunk:
       - write (x - j) spaces
       - write the chunk
       - set j equal to x plus the length of the chunk
- output a carriage return and continue

这将重建文本的可接受版本。对模 8 进行轻微修改甚至可以让您用制表符替换其中的一些 xj 空格。

于 2012-07-22T22:43:19.740 回答