python - 使用 re.split 在 Python 中将文件拆分为行

Question

我正在尝试使用类似于以下的代码拆分具有列表理解的文件：

lines = [x for x in re.split(r"\n+", file.read()) if not re.match(r"com", x)]

但是，行列表总是有一个空字符串作为最后一个元素。有谁知道避免这种情况的方法（不包括之后放置 pop() 的杂物）？

score 9 · Accepted Answer

把正则表达式锤子收起来:-)

您可以直接遍历文件；readlines()这些天几乎已经过时了。
阅读str.strip()（及其朋友，lstrip()和rstrip()）。
不要file用作变量名。这是一种不好的形式，因为file它是一个内置函数。

您可以将代码编写为：

lines = []
f = open(filename)
for line in f:
    if not line.startswith('com'):
        lines.append(line.strip())

如果你仍然在那里得到空行，你可以添加一个测试：

lines = []
f = open(filename)
for line in f:
    if line.strip() and not line.startswith('com'):
        lines.append(line.strip())

如果你真的想要它在一行中：

lines = [line.strip() for line in open(filename) if line.strip() and not line.startswith('com')]

最后，如果您使用的是 python 2.6，请查看with 语句以进一步改进。

score 3 · Accepted Answer

行 = file.readlines()

编辑： 或者如果你不想在那里有空行，你可以做

行 = 过滤器（lambda a:(a!='\n'), file.readlines())

编辑^2： 要删除尾随的换行符，您可以这样做

lines = [re.sub('\n','',line) for line in filter(lambda a:(a!='\n'), file.readlines())]

score 1 · Accepted Answer

另一个方便的技巧，尤其是当您需要行号时，是使用枚举：


fp = open("myfile.txt", "r")
for n, line in enumerate(fp.readlines()):
    dosomethingwith(n, line)

我最近才发现枚举，但从那以后它已经派上用场了好几次了。

score 0 · Accepted Answer

这应该有效，并且也消除了正则表达式：

all_lines = (line.rstrip()
             for line in open(filename)
             if "com" not in line)
# filter out the empty lines
lines = filter(lambda x : x, all_lines)

由于您使用的是列表推导而不是生成器表达式（因此整个文件都会被加载到内存中），所以这里有一个快捷方式，可以避免代码过滤掉空行：

lines = [line
     for line in open(filename).read().splitlines()
     if "com" not in line]

python - 使用 re.split 在 Python 中将文件拆分为行

4 回答 4

Related

Reference