0

我有一个带有下一个代码的页面:

<HTML>
<HEAD>
<TITLE>smth</TITLE>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
</HEAD>
<BODY>
<div id="doc" class="searchN">
<div id="hd" style="border-bottom:0;">
    <a id="logo" class="logoN" href="/" alt="logo" title="open project"></a>

</div> 
    <div id="bd-cross">    
        <ol class="site" start=1>

            <li class="">
                <a href="url/">Smth</a>
                <div class="ref">
                <a href="News_and_Media/">Regional: Europe:</a>
          </div>    
            </li>

            <li class="">
                <a href="url2">Descr3</a> 
                <div class="ref">
                <a href="url3">Descr3</a>   
          </div>    
            </li>
....
</BODY>
</HTML>

我需要检查<li class="">页面上的标签存在。我使用 Python+RegExp:

import re
import urllib2
url = 'url'
#Parse it
MainPage = urllib2.urlopen(url).read()
Li = re.findall("<div id=\"bd-cross\">*<li class=\"\">*</li>", MainPage)
try:
    if Li:
        print "Li tag on " +url+ ": Yes"
    else:
        print "Li tag on " +url+ ": No"
except:
    print "Error"

输出是 No 但它应该是 Yes 因为页面包含它标签。如果我打印 Li 它输出'[]'。

4

2 回答 2

2

您应该使用BeautifulSoupor之类的包lxml.html.soupparser,它会让您的生活更轻松。使用后者,您可以执行以下操作:

>>> import lxml.html.soupparser
>>> MainPage = urllib2.urlopen(url).read()
>>> HtmlDoc = lxml.html.soupparser.fromstring(MainPage)
>>> Elements = HmtlDoc.xpath('//div[@id="bd-cross"]//li[@class=""]')
>>> if len(Elements) > 0:
>>>     print 'Yes'
>>> else:
>>>     print 'No'
于 2013-02-07T09:07:35.363 回答
1

假设您不想使用像 BeautifulSoup 这样的 HTML 解析器,假设您在 HTML 中的某处有“bd-cross” div 标签,但不在您的摘录中,我敢打赌,您的正则表达式没有看到新的-线边界。

In fact, you are missing the . character in your regex, so I would also suggest using a regex tester to verify your regex does what you think it should, such as this one.

To fix this, add flags=re.DOTALL to the end of the re.findall function as another argument.

See the documentation

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

于 2013-02-07T09:10:30.543 回答