我尝试编写一个 HTML 解析器,但在测试期间我不想每次都查询网站,所以我将网站保存为本地 HTML 文件。
对于阅读,我使用:
urltext = urllib.request.urlopen(urlfile).read().decode("utf-8")
直接从网站上我得到一个正确的字符串来解析,但是当我从本地电脑打开它时,它似乎有一个错误的解码:
<span id="line845"></span> </span><span><<span class="start-tag">h2</span> <span class="attribute-name">class</span>="<a class="attribute-value">article-title</a>"></span><span>
<span id="line846"></span> </span><span><<span class="start-tag">span</span> <span class="attribute-name">class</span>="<a class="attribute-value">headline-intro</a>"></span><span>Intro:</span><span></<span class="end-tag">span</span>></span><span> </span><span><<span class="start-tag">span</span> <span class="attribute-name">class</span>="<a class="attribute-value">headline</a>"></span><span>Main text</span><span></<span class="end-tag">span</span>></span><span></span><span></<span class="end-tag">h2</span>></span><span>
最初它应该是这样的:
<h2 class="article-title">
<span class="headline-intro">Intro:</span> <span class="headline">Main Text</span></h2>
任何想法我做错了什么?
谢谢
凯夫