我有以下代码片段,它需要一个 url 打开它,只解析文本,然后搜索小部件。它检测小部件的方法是查找单词widget1
then endwidget
,它表示小部件的结束。
基本上,代码在找到单词后立即将所有文本行写入文件,widget1
并在读取时结束endwidget
。但是,我的代码在第一行之后缩进所有widget1
行。
这是我的输出
widget1 this is a really cool widget
it does x, y and z
and also a, b and c
endwidget
我想要的是:
widget1 this is a really cool widget
it does x, y and z
and also a, b and c
endwidget
为什么我得到这个缩进?这是我的代码...
for url in urls:
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
text= soup.prettify()
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
# If the parent of your element is any of those ignore it
return False
elif re.match('<!--.*-->', str(element)):
# If the element matches an html tag, ignore it
return False
else:
# Otherwise, return True as these are the elements we need
return True
visible_texts = filter(visible, texts)
inwidget=0
# open a file for write
for line in visible_texts:
# if line doesn't contain .widget1 then ignore it
if ".widget1" in line and inwidget==0:
match = re.search(r'\.widget1 (\w+)', line)
line = line.split (".widget1")[1]
# make the next word after .widget1 the name of the file
filename = "%s" % match.group(1) + ".txt"
textfile = open (filename, 'w+b')
textfile.write("source:" + url + "\n\n")
textfile.write(".widget1" + line)
inwidget = 1
elif inwidget == 1 and ".endwidget" not in line:
print line
textfile.write(line)
elif ".endwidget" in line and inwidget == 1:
textfile.write(line)
inwidget= 0
else:
pass