0

我的代码仅从网页正文返回文本。我正在尝试class="menu"从此页面正文中删除项目中的文本:

<div id="pre-header-links-inner" class="header-links"><ul id="menu-top-bar" class="menu"><li id="menu-item-22" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-22"><a href="tel:000-000-0000">Main Line: +1 000-000-0000</a></li>
<li id="menu-item-23" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-23"><a href="tel:100000000000">Sales: tel:000-000-0000</a></li>
<li id="menu-item-24" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-24"><a href="mailto:info@example.com">Email: info@example.com</a></li>
</ul></div>         
        </div>
        </div>
        </div>
        <!-- #pre-header -->

        <div id="header">
        <div id="header-core">

            <div id="logo">
            <a href="https://www.example.com/" class="custom-logo-link" rel="home" itemprop="url"><img width="253" height="50" src="https://www.example.com/logo.png" class="custom-logo" alt="Domain" itemprop="logo" /></a>           </div>

            <div id="header-links" class="main-navigation">
            <div id="header-links-inner" class="header-links">

                <ul id="menu-main-navigation" class="menu"><li id="menu-item-71" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home current-menu-item page_item page-item-2 current_page_item"><a href="https://www.example.com/"><span>Home</span></a></li>
<li id="menu-item-70" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com"><span>About Us</span></a></li>
<li id="menu-item-108" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/services/"><span>Services</span></a></li>
<li id="menu-item-124" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/api/"><span>API</span></a></li>
<li id="menu-item-68" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/contact-us/"><span>Contact Us</span></a></li>
</ul>               

            </div>
            </div>
            <!-- #header-links .main-navigation -->

            <div id="header-nav"><a class="btn-navbar" data-toggle="collapse" data-target=".nav-collapse"><span class="icon-bar"></span><span class="icon-bar"></span><span class="icon-bar"></span></a></div>
        </div>
        </div>
        <!-- #header -->

        <div id="header-responsive"><div id="header-responsive-inner" class="responsive-links nav-collapse collapse"><ul id="menu-main-navigation-1" class=""><li id="res-menu-item-71" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home current-menu-item page_item page-item-2 current_page_item"><a href="https://example.com/"><span>Home</span></a></li>
<li id="res-menu-item-70" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/about-us/"><span>About Us</span></a></li>
<li id="res-menu-item-108" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/services/"><span>Services</span></a></li>
<li id="res-menu-item-124" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/api/"><span>API</span></a></li>
<li id="res-menu-item-68" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/contact-us/"><span>Contact Us</span></a></li>
</ul></div></div>
                <div id="header-sticky">
        <div id="header-sticky-core">

            <div id="logo-sticky">
            <a href="https://www.example.com/" class="custom-logo-link" rel="home" itemprop="url"><img width="253" height="50" src="https://www.example.com/logo.png" class="custom-logo" alt="Logo" itemprop="logo" /></a>         </div>

            <div id="header-sticky-links" class="main-navigation">
            <div id="header-sticky-links-inner" class="header-links">

                <ul id="menu-main-navigation-2" class="menu"><li id="menu-item-71" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home current-menu-item page_item page-item-2 current_page_item"><a href="https://www.example.com/"><span>Home</span></a></li>
<li id="menu-item-70" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/about-us/"><span>About Us</span></a></li>
<li id="menu-item-108" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/services/"><span>Services</span></a></li>
<li id="menu-item-124" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/api/"><span>API</span></a></li>
<li id="menu-item-68" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/contact-us/"><span>Contact Us</span></a></li>
</ul>   

奇怪的是 - 当我调用以下行时:

text = "".join(tree.xpath("//body//*[not(@class='menu')]//text()")).strip()

它按原样返回整个纯文本源代码(即,即使是class="text"元素中的文本)。

但是,当我删除not关键字时:

text = "".join(tree.xpath("//body//*[(@class='menu')]//text()")).strip()

...它正确地从class="text"元素中识别文本并完美地隔离它们的文本:

Main Line: +000-000-0000
Sales: +1 000-000-0000
Email: info@example.com
Home
About Us
Services
API
Contact Us
Home
About Us
Services
API
Contact Us

我做错了什么?我希望它从除了class='menu'.

4

1 回答 1

0

它返回整个纯文本源代码

您需要清楚 XPath 表达式 SELECTS 和处理 XPath 结果的应用程序 DISPLAYS 之间的区别。

XPath 返回一组节点,调用应用程序通过显示以该节点为根的整个子树来显示每个节点是非常常见的做法。但这样做的不是 XPath,而是 XPath。它是调用应用程序。您的选择标准决定了 XPath 表达式选择了哪些节点,但它们不影响调用应用程序显示这些选定节点的哪些后代。

于 2017-09-15T08:45:58.443 回答