python - 仅使用内置库在 Python 中制作基本的网络爬虫 - Python

Question

学习 Python，我正在尝试制作一个没有任何 3rd 方库的网络爬虫，所以这个过程对我来说不会简化，而且我知道我在做什么。我浏览了几个在线资源，但所有这些都让我对某些事情感到困惑。

html看起来像这样，

<html>
<head>...</head>
<body>
    *lots of other <div> tags*
<div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal"">
<form class ="subform">...</form>
<div class = "subdiv1" >...</div>
<div class = "subdiv2" >...</div>
    *lots of other <div> tags*
</body>
</html>

我希望刮板提取<div class = "want"...>*content*</div>并将其保存到 html 文件中。

我对如何解决这个问题有一个非常基本的想法。

import urllib
from urllib import request
#import re
#from html.parser import HTMLParser

response = urllib.request.urlopen("http://website.com")
html = response.read()

#Some how extract that wanted data

f = open('page.html', 'w')
f.write(data)
f.close()

score 4 · Accepted Answer

标准库带有各种结构化标记处理工具，您可以使用它们来解析 HTML，然后搜索它以提取您的 div。

那里有很多选择。你用什么？

html.parser看起来是显而易见的选择，但实际上我会开始ElementTree。这是一个非常好的和非常强大的 API，整个网络上有大量的文档和示例代码可以帮助您入门，并且每天都有很多专家使用它来帮助您解决问题。如果事实证明 etree 无法解析您的 HTML，您将不得不使用其他东西……但请先尝试。

例如，通过对您的 HTML 进行一些小的修复，它实际上是有效的，因此实际上有一些文本值得从您的 div 中取出：

<html>
<head>...</head>
<body>
    *lots of other <div /> tags*
<div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal">spam spam spam
<form class ="subform">...</form>
<div class = "subdiv1" >...</div>
<div class = "subdiv2" >...</div>
    *lots of other <div /> tags*
</div>
</body>
</html>

您可以使用这样的代码（我假设您知道或愿意学习 XPath）：

tree = ElementTree.fromstring(page)
mydiv = tree.find('.//div[@class="want"]')

现在您已经获得了对divwith 类的引用"want"。您可以通过以下方式获取其直接文本：

print(mydiv.text)

但是如果你想提取整个子树，那就更容易了：

data = ElementTree.tostring(mydiv)

如果您想将其包装成有效的<html>和<body>/或删除它<div>本身，则必须手动完成该部分。该文档解释了如何使用简单的树 API 构建元素：创建 ahead和 abody放入html，然后将放入div，body然后，仅此而已。tostringhtml

python - 仅使用内置库在 Python 中制作基本的网络爬虫 - Python

1 回答 1

Related

Reference