以下是输入文件的示例:
<html xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body>
HERE IS A LOT OF TEXT, THAT IS NOT INTERESTING
<br>
<div id="text"><div id="text-interesting1">11/222-AA</div>
<h2>This is the title</h2>
<P>Here is some multiline desc-<br>
cription about what is <br><br>
going on here
</div>
<div id="text2"><div id="text-interesting2">IV-VI</div>
<br>
<h1> Some really interesting text</h1>
</body>
</html>
现在我想 grep 这个文件的多个块,比如介于之间<div id="text-interesting1">
,</div>
然后介于之间,然后介于<P>
和</div>
之间<div id="text-interesting2">
以及</div>
更多。关键是,我要检索多个值。
我想将这些值写入文件,例如逗号分隔。怎么可能呢?
从卢克提供的例子中,我做了以下几点:
import os, re
path = 'C:/Temp/Folder1/allTexts'
listing = os.listdir(path)
for infile in listing:
text = open(path + '/' + infile).read()
match = re.search('<div id="text-interesting1">', text)
if match is None:
continue
start = match.end()
end = re.search('</div>', text).start()
print (text[start:end])
match = re.search('<h2>', text)
if match is None:
continue
start = match.end()
end = re.search('</h2>', text).start()
print (text[start:end])
match = re.search('<P>', text)
if match is None:
continue
start = match.end()
end = re.search('</div>', text).start()
print (text[start:end])
match = re.search('<div id="text-interesting2">', text)
if match is None:
continue
start = match.end()
end = re.search('</div>', text).start()
print (text[start:end])
match = re.search('<h1>', text)
if match is None:
continue
start = match.end()
end = re.search('</h1>', text).start()
print (text[start:end])
print ('--------------------------------------')
输出是:
11/222-AA
This is the title
Some really interesting text
--------------------------------------
22/4444-AA
22222 This is the title2
22222222222222222222222
--------------------------------------
33/4444-AA
3333 This is the title3
333333333333333333333333
--------------------------------------
为什么
部分不起作用?