0

我有以下代码片段,它需要一个 url 打开它,只解析文本,然后搜索小部件。它检测小部件的方法是查找单词widget1then endwidget,它表示小部件的结束。

基本上,代码在找到单词后立即将所有文本行写入文件,widget1并在读取时结束endwidget。但是,我的代码在第一行之后缩进所有widget1行。

这是我的输出

widget1 this is a really cool widget
       it does x, y and z 
       and also a, b and c
       endwidget

我想要的是:

widget1 this is a really cool widget
it does x, y and z 
and also a, b and c
endwidget

为什么我得到这个缩进?这是我的代码...

 for url in urls:
        page = mech.open(url)
        html = page.read()
        soup = BeautifulSoup(html)
        text= soup.prettify()
        texts = soup.findAll(text=True) 

        def visible(element):
            if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: 
            # If the parent of your element is any of those ignore it

                return False

            elif re.match('<!--.*-->', str(element)):
            # If the element matches an html tag, ignore it

                return False

            else:
            # Otherwise, return True as these are the elements we need

              return True

        visible_texts = filter(visible, texts)

        inwidget=0
        # open a file for write
        for line in visible_texts:
        # if line doesn't contain .widget1 then ignore it
            if ".widget1" in line and inwidget==0:
                match = re.search(r'\.widget1 (\w+)', line)
                line = line.split (".widget1")[1]   
                # make the next word after .widget1 the name of the file
                filename = "%s" % match.group(1) + ".txt"
                textfile = open (filename, 'w+b')
                textfile.write("source:" + url + "\n\n")
                textfile.write(".widget1" + line)
                inwidget = 1
            elif inwidget == 1 and ".endwidget" not in line:
                print line
                textfile.write(line)
            elif ".endwidget" in line and inwidget == 1:
                textfile.write(line)
                inwidget= 0
            else:
                pass
4

2 回答 2

1

您在除第一行之外的所有行中都获得此缩进的原因是因为您编辑该行的第一行是textfile.write(".widget1" + line)您直接从包含缩进的 html 文件中获取的其余行。您可以通过在行上使用str.strip()来删除不需要的空格并更改textfile.write(line)textfile.write(line.strip()).

于 2012-11-22T14:45:16.297 回答
0

要从您的输出转到您想要的输出,请执行以下操作:

#a is your output
a= '\n'.join(map(lambda x: x.strip(),a.split('\n')))
于 2012-11-22T14:34:42.000 回答