4

假设我有这个文件:

1
17:02,111
Problem report related to
router

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk
now due to compromised data

我想要这个输出:

1
17:02,111
Problem report related to router

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk now due to compromised data

一直在尝试 bash 并找到一种接近的解决方案,但我不知道如何在 Python 上执行此操作。

先感谢您

4

3 回答 3

4

如果要删除 extea 行:

为此,如果该行后面没有空的新行,或者该行之前应该有与以下 regex 匹配的行,您可以检查每个类似的 2 个条件^\d{2}:\d{2},\d{3}\s$

因此,为了在每次迭代中访问下一行,您可以从主文件对象创建一个文件对象,其名称为tempusingitertools.tee并在其上应用该next函数。并用于re.match匹配正则表达式。

from itertools import tee
import re
with open('ex.txt') as f,open('new.txt','w') as out:
    temp,f=tee(f)
    next(temp)
    try:
        for line in f:
            if next(temp) !='\n' or re.match(r'^\d{2}:\d{2},\d{3}\s$',pre):
                out.write(line)
            pre=line
    except :
        pass

结果 :

1
17:02,111
Problem report related to

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk

如果要将其余部分连接到第三行:

如果您想将第三行之后的其余行连接到第三行,您可以使用以下正则表达式来查找文件 ( )后面\n\n或结尾的所有块:$

r"(.*?)(?=\n\n|$)"

然后根据日期格式的行拆分块并将部分写入输出文件,但请注意,您需要用空格替换第 3 部分中的新行:

ex.txt:

1
17:02,111
Problem report related to
router
another line


2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk
now due to compromised data
line 5
line 6
line 7

演示:

def splitter(s):
    for x in re.finditer(r"(.*?)(?=\n\n|$)", s,re.DOTALL):
          g=x.group(0)
          if g:
            yield g

import re
with open('ex.txt') as f,open('new.txt','w') as out:
    for block in splitter(f.read()):
        first,second,third= re.split(r'(\d{2}:\d{2},\d{3}\n)',block)
        out.write(first+second+third.replace('\n',' '))

结果 :

1
17:02,111
Problem report related to router another line
2
17:05,223
Restarting the systems
3
18:02,444
Must erase hard disk now due to compromised data line 5 line 6 line 7

注意

在这个答案中,该splitter函数返回一个生成器,当您处理大文件并拒绝在内存中存储不可用的行时,该生成器非常有效。

于 2015-06-19T09:39:03.097 回答
3

当且仅当文件符合您给定的样本时,这才有效

笔记:

There may be a faster way if regex is used and it might also be simpler但想以合乎逻辑的方式做到这一点

代码:

inp=open("output.txt","r")
inp=inp.read().split("\n")
print inp
tempString=""
output=[]
w=0

for s in inp:
    if s:
        if any(c.isalpha() for c in s):
            tempString=tempString+" "+s
        else:
            w=0
            if tempString:
                output.append(tempString.strip())
                tempString=""
            output.append(s)       

    else:
        if tempString:
            output.append(tempString.strip())
            tempString=""
        output.append(" ")
if tempString:
    output.append(tempString.strip())


print "\n".join(output)
out=open("newoutput.txt","w")
out.write("\n".join(output))
out.close()

输入:

1
17:02,111
Problem report related to
2 router

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk
now due to compromised data

4
17:02,111
Problem report related to
router

输出:

1
17:02,111
Problem report related to 2 router

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk now due to compromised data

4
17:02,111
Problem report related to router
于 2015-06-19T10:58:02.217 回答
1
x="""1
17:02,111
Problem report related to
router

2
17:05,223
Restarting the systems

3
18:02,444
Must erase hard disk
now due to compromised data
or something"""
def repl(matchobj):
    ll=matchobj.group().split("\n")
    return "\n".join(ll[:3])+" "+" ".join(ll[3:])
print re.sub(r"\b\d+\n\d+:\d+,\d+\b[\s\S]*?(?=\n{2}|$)",repl,x)

您可以使用re.sub自己的自定义替换功能。

于 2015-06-19T10:04:11.960 回答