python - 抓取 tag with lxml's iterparse</h1> <div id="body"><p>I'm running into a problem with using lxml's <code>iterparse</code> on my HTML. I'm trying to get the <code><title></code>'s text but this simple f</a></h1> <div class="ml12 aside-cta flex--item print:d-none sm:ml0 sm:mb12 sm:order-first sm:as-end"> <a href="https://stackoverflow.com/questions/ask" target="_blank" class="ws-nowrap s-btn s-btn__primary">问问题</a></div> </div> <div class="d-flex fw-wrap pb8 mb16 bb bc-black-075"> <div class="flex--item ws-nowrap mr16 mb8"> <span class="fc-light mr2"></span> </div> <div class="flex--item ws-nowrap mr16 mb8" title="2022-04-17 15:46:40Z"> <span class="fc-light mr2">问问题</span> <time itemprop="dateCreated" datetime="2012-04-24T01:16:58.927">2012-04-24T01:16:58.927</time> </div> <div class="flex--item ws-nowrap mb8" title="Viewed 6 times"> <span class="fc-light mr2"></span> 1235 次 </div> </div> <div id="mainbar" role="main" aria-label="question and answers"> <div class="question" data-questionid="4" data-position-on-page="0" data-score="763" id="question"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container d-flex jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="4"> <button class="js-vote-up-btn flex--item s-btn s-btn__unset c-pointer " data-controller="s-tooltip" data-s-tooltip-placement="right" aria-pressed="false" aria-label="Up vote" data-selected-classes="fc-theme-primary" data-unselected-classes="" aria-describedby="--stacks-s-tooltip-peeufs8c"> <svg aria-hidden="true" class="svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 25h32L18 9 2 25Z"></path></svg> </button> <div id="--stacks-s-tooltip-peeufs8c" class="s-popover s-popover__tooltip pe-none" aria-hidden="true" role="tooltip">This question shows research effort; it is useful and clear<div class="s-popover--arrow"></div></div> <div class="js-vote-count flex--item d-flex fd-column ai-center fc-black-500 fs-title" itemprop="upvoteCount" data-value=""> 3 </div> <button class="js-vote-down-btn flex--item s-btn s-btn__unset c-pointer " data-controller="s-tooltip" data-s-tooltip-placement="right" aria-pressed="false" aria-label="Down vote" data-selected-classes="fc-theme-primary" data-unselected-classes="" aria-describedby="--stacks-s-tooltip-04106eqn"> <svg aria-hidden="true" class="svg-icon iconArrowDownLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 11h32L18 27 2 11Z"></path></svg> </button><div id="--stacks-s-tooltip-04106eqn" class="s-popover s-popover__tooltip pe-none" aria-hidden="true" role="tooltip">This question does not show any research effort; it is unclear or not useful<div class="s-popover--arrow"></div></div> <div id="--stacks-s-tooltip-tgvwendx" class="s-popover s-popover__tooltip pe-none" aria-hidden="true" role="tooltip">Bookmark this question.<div class="s-popover--arrow"></div></div> <a class="js-post-issue flex--item s-btn s-btn__unset c-pointer py6 mx-auto" data-shortcut="T" data-ks-title="timeline" data-controller="s-tooltip" data-s-tooltip-placement="right" aria-label="Timeline" aria-describedby="--stacks-s-tooltip-abwmy15k"><svg aria-hidden="true" class="mln2 mr0 svg-icon iconHistory" width="19" height="18" viewBox="0 0 19 18"><path d="M3 9a8 8 0 1 1 3.73 6.77L8.2 14.3A6 6 0 1 0 5 9l3.01-.01-4 4-4-4h3L3 9Zm7-4h1.01L11 9.36l3.22 2.1-.6.93L10 10V5Z"></path></svg></a><div id="--stacks-s-tooltip-abwmy15k" class="s-popover s-popover__tooltip pe-none" aria-hidden="true" role="tooltip">Show activity on this post.<div class="s-popover--arrow"></div></div> </div> </div> <div class="postcell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"> </div> <div class="mt24 mb12"> <div class="post-taglist d-flex gs4 gsy fd-column"> <div class="d-flex ps-relative fw-wrap"> <a href="/tags/python" class="post-tag js-gps-track" title="show questions tagged 'python'" rel="tag">python</a><a href="/tags/dom" class="post-tag js-gps-track" title="show questions tagged 'dom'" rel="tag">dom</a><a href="/tags/web-scraping" class="post-tag js-gps-track" title="show questions tagged 'web-scraping'" rel="tag">web-scraping</a><a href="/tags/lxml" class="post-tag js-gps-track" title="show questions tagged 'lxml'" rel="tag">lxml</a><a href="/tags/iterparse" class="post-tag js-gps-track" title="show questions tagged 'iterparse'" rel="tag">iterparse</a> </div> </div> </div> </div> <span class="d-none" itemprop="commentCount">4</span> </div> </div> <div class="js-zone-container zone-container-responsive"> <div id="dfp-isb" class="everyonelovesstackoverflow everyoneloves__inline-sidebar mx-auto" style="min-height: auto; height: auto; display: none;"></div> <div class="js-report-ad-button-container mx-auto" style="width: 300px"></div> </div> <div id="answers"> <a name="tab-top"></a> <div id="answers-header"> <div class="answers-subheader d-flex ai-center mb8"> <div class="flex--item fl1"> <h2 class="mb0" data-answercount=""> 1 回答 <span style="display:none;" itemprop="answerCount">1</span> </h2> </div> </div> </div> <a name="7"></a> <div id="answer-7" class="answer js-answer accepted-answer" data-answerid="7" data-parentid="4" data-score="506" data-position-on-page="1" data-highest-scored="1" data-question-has-accepted-highest-score="1" itemprop="suggestedAnswer" itemscope="" itemtype="https://schema.org/Answer"> <div class="post-layout"> <div class="votecell post-layout--left"> <div class="js-voting-container d-flex jc-center fd-column ai-stretch gs4 fc-black-200" data-post-id="7"> <button class="js-vote-up-btn flex--item s-btn s-btn__unset c-pointer " data-controller="s-tooltip" data-s-tooltip-placement="right" aria-pressed="false" aria-label="Up vote" data-selected-classes="fc-theme-primary" data-unselected-classes="" aria-describedby="--stacks-s-tooltip-dgvag2l3"> <svg aria-hidden="true" class="svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 25h32L18 9 2 25Z"></path></svg> </button><div id="--stacks-s-tooltip-dgvag2l3" class="s-popover s-popover__tooltip pe-none" aria-hidden="true" role="tooltip">This answer is useful<div class="s-popover--arrow"></div></div> <div class="js-vote-count flex--item d-flex fd-column ai-center fc-black-500 fs-title" itemprop="upvoteCount" data-value="2"> 2 </div> <button class="js-vote-down-btn flex--item s-btn s-btn__unset c-pointer " data-controller="s-tooltip" data-s-tooltip-placement="right" aria-pressed="false" aria-label="Down vote" data-selected-classes="fc-theme-primary" data-unselected-classes="" aria-describedby="--stacks-s-tooltip-gn8ppsfv"> <svg aria-hidden="true" class="svg-icon iconArrowDownLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 11h32L18 27 2 11Z"></path></svg> </button> </div> </div> <div class="answercell post-layout--right"> <div class="s-prose js-post-body" itemprop="text"> <p>您可能希望至少发布您实际尝试解析的数据的一部分。没有这些信息,这里是一个猜测。如果 <code><html></code>元素定义了默认的 XML 命名空间,则在查找元素时需要使用它。例如,看看这个简单的文档:</p> <pre><code><?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd" xml:lang="en"> <head> <title>Document Title</title> </head> <body> </body> </html> </code></pre> <p>鉴于此输入,以下将不返回任何结果:</p> <pre><code>>>> doc = etree.parse(open('foo.html')) >>> doc.xpath('//title') [] </code></pre> <p>这失败了,因为我们正在寻找一个<code><title></code>没有指定命名空间的元素......并且没有命名空间,解析器不会找到匹配项(因为<code>foo:title</code>不同于 <code>bar:title</code>,假设<code>foo:</code>和<code>bar:</code>是定义的 XML 命名空间)。</p> <p>您可以使用 ElementTree 接口显式使用命名空间,如下所示:</p> <pre><code>>>> doc.xpath('//html:title', ... namespaces={'html': 'http://www.w3.org/1999/xhtml'}) [<Element {http://www.w3.org/1999/xhtml}title at 0x1087910>] </code></pre> <p>还有我们的比赛。</p> <p>您也可以将命名空间前缀传递给<code>tag</code>iterparse 的参数:</p> <pre><code>>>> titleIter = etree.iterparse(StringIO(str), ... tag='{http://www.w3.org/1999/xhtml}title') >>> list(titleIter) [(u'end', <Element {http://www.w3.org/1999/xhtml}title at 0x7fddb7c4b8c0>)] </code></pre> <p>如果这不能解决您的问题,请发布一些示例输入,我们将从那里开始工作。</p> </div> <div class="mt24"> <div class="user-action-time" style="color:#999;text-align:right;">于 2012-04-24T01:40:07.603 回答</div> </div> </div> </div> </div></div> </div> <div id="sidebar" class="show-votes" role="complementary" aria-label="sidebar"> <div class="module sidebar-related"> <h4 id="h-related">Related</h4> <div class="related js-gps-related-questions" data-tracker="rq=1"> <div class="spacer"> <a href="/questions/14248721" title="Question score (upvotes - downvotes)"> <div class="answer-votes large">1</div> </a> <a href="/questions/14248721" class="question-hyperlink">php - 扩展 htaccess 重写规则</a> </div><div class="spacer"> <a href="/questions/14248725" title="Question score (upvotes - downvotes)"> <div class="answer-votes large">2</div> </a> <a href="/questions/14248725" class="question-hyperlink">algorithm - 生成总和为给定数字的统一二进制数组</a> </div><div class="spacer"> <a href="/questions/14248726" title="Question score (upvotes - downvotes)"> <div class="answer-votes large">4</div> </a> <a href="/questions/14248726" class="question-hyperlink">google-apps-script - 谷歌电子表格:所有粗体单元格的总和</a> </div><div class="spacer"> <a href="/questions/14248732" title="Question score (upvotes - downvotes)"> <div class="answer-votes large">1</div> </a> <a href="/questions/14248732" class="question-hyperlink">ios - 对 NSManagedObjectContext 的并发更改 - 如何以及何时保存</a> </div><div class="spacer"> <a href="/questions/14248735" title="Question score (upvotes - downvotes)"> <div class="answer-votes large">2</div> </a> <a href="/questions/14248735" class="question-hyperlink">c# - 如何验证是否调用了两种方法之一?</a> </div><div class="spacer"> <a href="/questions/14248736" title="Question score (upvotes - downvotes)"> <div class="answer-votes large">0</div> </a> <a href="/questions/14248736" class="question-hyperlink">python - Sublime Text 动态 tmLanguage 文件</a> </div><div class="spacer"> <a href="/questions/14248738" title="Question score (upvotes - downvotes)"> <div class="answer-votes large">1</div> </a> <a href="/questions/14248738" class="question-hyperlink">php - PHP GD imagefill -- 填充到源代码的左侧</a> </div><div class="spacer"> <a href="/questions/14248743" title="Question score (upvotes - downvotes)"> <div class="answer-votes large">1</div> </a> <a href="/questions/14248743" class="question-hyperlink">actionscript-3 - hitTestpoint 在动作脚本 3 中给出错误</a> </div><div class="spacer"> <a href="/questions/14248744" title="Question score (upvotes - downvotes)"> <div class="answer-votes large">3</div> </a> <a href="/questions/14248744" class="question-hyperlink">php - PHP/Javascript 共享压缩库</a> </div><div class="spacer"> <a href="/questions/14248745" title="Question score (upvotes - downvotes)"> <div class="answer-votes large">1</div> </a> <a href="/questions/14248745" class="question-hyperlink">java - 为一行生成 id</a> </div> </div> </div> <div class="module js-gps-related-tags" id="related-tags"> <h4 id="h-related-tags">Reference</h4> <div data-name="javascript"> <a href="https://php.github.net.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">php</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">1429865</span> </span> </div> <div data-name="javascript"> <a href="https://c-cpp.com" class="post-tag no-tag-menu js-gps-track" target="_blank">c/c++</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">756500</span> </span> </div> <div data-name="javascript"> <a href="https://nginx.github.net.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">nginx</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">49975</span> </span> </div> <div data-name="javascript"> <a href="https://mongodb.net.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">mongodb</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">159057</span> </span> </div> <div data-name="javascript"> <a href="https://mybatis.net.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">mybatis</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">3233</span> </span> </div> <div data-name="javascript"> <a href="https://anaconda.org.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">anaconda</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">13410</span> </span> </div> <div data-name="javascript"> <a href="https://pycharm.net.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">pycharm</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">14671</span> </span> </div> <div data-name="javascript"> <a href="https://python.github.net.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">python</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">1902243</span> </span> </div> <div data-name="javascript"> <a href="https://vscode.github.net.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">vscode</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">56040</span> </span> </div> <div data-name="javascript"> <a href="https://dockerdocs.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">docker</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">110988</span> </span> </div> <div data-name="javascript"> <a href="https://github.net.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">github</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">49000</span> </span> </div> <div data-name="javascript"> <a href="https://flask.github.net.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">flask</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">49129</span> </span> </div> <div data-name="javascript"> <a href="https://ffmpeg.github.net.cn" class="post-tag no-tag-menu js-gps-track" target="_blank">ffmpeg</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">24037</span> </span> </div> <div data-name="javascript"> <a href="https://jmeter.net" class="post-tag no-tag-menu js-gps-track" target="_blank">jmeter</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">16910</span> </span> </div> <div data-name="javascript"> <a href="https://matplotlib.net" class="post-tag no-tag-menu js-gps-track" target="_blank">matplotlib</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">63493</span> </span> </div> <div data-name="javascript"> <a href="https://getbootstrap.net" class="post-tag no-tag-menu js-gps-track" target="_blank">bootstrap</a> <span class="item-multiplier"><span class="item-multiplier-x">×</span> <span class="item-multiplier-count">54641</span> </span> </div> </div> </div> </div> </div> </div> </div> <footer id="footer" class="site-footer js-footer" role="contentinfo"> <div class="site-footer--container"> <div class="site-footer--logo"> <a href="https://stackoverflow.com"><svg aria-hidden="true" class="native svg-icon iconLogoGlyphMd" width="32" height="37" viewBox="0 0 32 37"><path d="M26 33v-9h4v13H0V24h4v9h22Z" fill="#BCBBBB"/><path d="m21.5 0-2.7 2 9.9 13.3 2.7-2L21.5 0ZM26 18.4 13.3 7.8l2.1-2.5 12.7 10.6-2.1 2.5ZM9.1 15.2l15 7 1.4-3-15-7-1.4 3Zm14 10.79.68-2.95-16.1-3.35L7 23l16.1 2.99ZM23 30H7v-3h16v3Z" fill="#F48024"/></svg></a> </div> <nav class="site-footer--nav"> <div class="site-footer--col"> <h5 class="-title"><a href="https://stackoverflow.org.cn" class="js-gps-track" data-gps-track="footer.click({ location: 3, link: 15})">Stack Overflow 中文网</a></h5> <p>遵从 CC BY-SA 知识共享许可协议。</p> </div> </nav> </div> </footer> <script> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?709ff2ad9744e86b5b0eee677fc13ede"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> <!-- Google tag (gtag.js) --> <script async src="https://www.googletagmanager.com/gtag/js?id=G-1MW5BV8G8E"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-1MW5BV8G8E'); </script> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-6117966252207595" crossorigin="anonymous"></script> </body> </html>