python - 使用 Python 检查页面上是否存在标签

Question

我有一个带有下一个代码的页面：

<HTML>
<HEAD>
<TITLE>smth</TITLE>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
</HEAD>
<BODY>
<div id="doc" class="searchN">
<div id="hd" style="border-bottom:0;">
    <a id="logo" class="logoN" href="/" alt="logo" title="open project"></a>

</div> 
    <div id="bd-cross">    
        <ol class="site" start=1>

            <li class="">
                <a href="url/">Smth</a>
                <div class="ref">
                <a href="News_and_Media/">Regional: Europe:</a>
          </div>    
            </li>

            <li class="">
                <a href="url2">Descr3</a> 
                <div class="ref">
                <a href="url3">Descr3</a>   
          </div>    
            </li>
....
</BODY>
</HTML>

我需要检查<li class="">页面上的标签存在。我使用 Python+RegExp：

import re
import urllib2
url = 'url'
#Parse it
MainPage = urllib2.urlopen(url).read()
Li = re.findall("<div id=\"bd-cross\">*<li class=\"\">*</li>", MainPage)
try:
    if Li:
        print "Li tag on " +url+ ": Yes"
    else:
        print "Li tag on " +url+ ": No"
except:
    print "Error"

输出是 No 但它应该是 Yes 因为页面包含它标签。如果我打印 Li 它输出'[]'。

score 2 · Accepted Answer

您应该使用BeautifulSoupor之类的包lxml.html.soupparser，它会让您的生活更轻松。使用后者，您可以执行以下操作：

>>> import lxml.html.soupparser
>>> MainPage = urllib2.urlopen(url).read()
>>> HtmlDoc = lxml.html.soupparser.fromstring(MainPage)
>>> Elements = HmtlDoc.xpath('//div[@id="bd-cross"]//li[@class=""]')
>>> if len(Elements) > 0:
>>>     print 'Yes'
>>> else:
>>>     print 'No'

score 1 · Accepted Answer

假设您不想使用像 BeautifulSoup 这样的 HTML 解析器，并假设您在 HTML 中的某处有“bd-cross” div 标签，但不在您的摘录中，我敢打赌，您的正则表达式没有看到新的-线边界。

In fact, you are missing the . character in your regex, so I would also suggest using a regex tester to verify your regex does what you think it should, such as this one.

To fix this, add flags=re.DOTALL to the end of the re.findall function as another argument.

See the documentation

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

python - 使用 Python 检查页面上是否存在标签

2 回答 2

Related

Reference