python - 删除文件中的多个 EOL

Question

我有一个带有 \n EOL 字符的制表符分隔文件，看起来像这样：

User Name\tCode\tTrack\tColor\tNote\n\nUser Name2\tCode2\tTrack2\tColor2\tNote2\n

我正在使用这个输入文件并将其重新格式化为一个嵌套列表split('\t')。该列表应如下所示：

[['User Name','Code','Track','Color','Note'],
 ['User Name2','Code2','Track2','Color2','Note2']]

生成文件的软件允许用户在填写“注释”字段时按“输入”键任意次数。它还允许用户按“输入”创建任意数量的换行符，而无需在“注释”字段中输入任何可见文本。

最后，用户可以在“Note”中间按“enter”任意次数创建多个段落，但从操作的角度来看，这种情况很少见，如果它变得复杂，我愿意不解决这种可能性代码很多。这种可能性真的非常低优先级。

从上面的示例中可以看出，这些操作可能会导致一系列“\n\n...”代码在“注释”字段之前、尾随或替换任何长度。或者这样说，在将文件对象放入列表之前，需要进行以下替换：

\t\n\n... preceding "Note" must become \t
\n\n... trailing "note" must become \n
\n\n... in place of "note" must become \n
\n\n... in the middle of the text note must become a single whitespace, if easy to do

我曾尝试使用 strip() 和 replace() 方法但没有成功。在使用 replace() 方法之前，是否需要先将文件对象复制到其他东西中？

我有使用 Awk 的经验，但我希望不需要正则表达式，因为我对 Python 非常陌生。这是我需要改进以解决多个换行符的代码：

marker = [i.strip() for i in open('SomeFile.txt', 'r')]

marker_array = []
for i in marker:
    marker_array.append(i.split('\t'))

for i in marker_array:
    print i

score 4 · Accepted Answer

计算标签；如果您假设注释字段的一行中永远不会有 4 个选项卡，您可以收集注释，直到找到包含4 个选项卡的行：

def collapse_newlines(s):
    # Collapse multiple consecutive newlines into one; removes trailing newlines
    return '\n'.join(filter(None, s.split('\n')))

def read_tabbed_file(filename):
    with open(filename) as f:
        row = None
        for line in f:
            if line.count('\t') < 4:   # Note continuation
                row[-1] += line
                continue

            if row is not None:
                row[-1] = collapse_newlines(row[-1])
                yield row

            row = line.split('\t')

        if row is not None:
            row[-1] = collapse_newlines(row[-1])
            yield row

上面的生成器函数在确定下一行没有音符继续之前不会产生一行，有效地向前看。

现在将该read_tabbed_file()函数用作生成器并循环遍历结果：

for row in read_tabbed_file(yourfilename):
    # row is a list of elements

演示：

>>> open('/tmp/test.csv', 'w').write('User Name\tCode\tTrack\tColor\tNote\n\nUser Name2\tCode2\tTrack2\tColor2\tNote2\n')
>>> for row in read_tabbed_file('/tmp/test.csv'):
...     print row
... 
['User Name', 'Code', 'Track', 'Color', 'Note']
['User Name2', 'Code2', 'Track2', 'Color2', 'Note2']

score 1 · Accepted Answer

您遇到的第一个问题是in- 它试图提供帮助并一次从文件中读取一行文本。

>>> [i for i in open('SomeFile.txt', 'r') ]
['User Name\tCode\tTrack\tColor\tNote\n', '\n', 'User Name2\tCode2\tTrack2\tColor2\tNote2\n', '\n']

添加调用.strip()确实会从每一行中删除空格，但这会给您留下空行 - 它不会将那些空元素从列表中删除。

>>> [i.strip() for i in open('SomeFile.txt', 'r') ]
['User Name\tCode\tTrack\tColor\tNote', '', 'User Name2\tCode2\tTrack2\tColor2\tNote2', '']

但是，您可以在if列表推导中提供 in 子句，以使其删除只有换行符的行：

>>> [i.strip() for i in open('SomeFile.txt', 'r') if len(i) >1 ]
['User Name\tCode\tTrack\tColor\tNote', 'User Name2\tCode2\tTrack2\tColor2\tNote2']
>>>

score 0 · Accepted Answer

我认为，那个 csv 模块会帮助你。

例如看这个：Parsing CSV / tab-delimited txt file with Python。

python - 删除文件中的多个 EOL

3 回答 3

Related

Reference