python - 如何使用 Python2.7 在网页上显示所有 ID 的值？

Question

我需要显示给定网站上所有 ID 的值。是否有一个函数urllib可以urllib2让我阅读该站点，然后只打印“id =”之后的值？对此的任何帮助将不胜感激。

score 2 · Accepted Answer

我会使用 BeautifulSoup 和请求来做到这一点。我使用这个页面整理了一个简单的示例，并将其发布在Github上。

请注意，这里的真正工作是在 return 语句中——其中大部分是样板文件。

from bs4 import BeautifulSoup as BS
import requests as r

def get_ids_from_page(page):
    response = r.get(page)
    soup = BS(response.content).body

    return sorted([x.get('id') for x in soup.find_all() if x.get('id') is not None])

if __name__ == '__main__':
    # In response to the question at the URL below - in short "How do I get the
    #   ids from all objects on a page in Python?"
    ids = get_ids_from_page('http://stackoverflow.com/questions/14347086/')

    for val in ids:
        print val

score 0 · Accepted Answer

您可以使用正则表达式：

import re

id_list = re.findall('id="(.*?)"', html_text)

或者更复杂一点（以确保您仅从 HTML 标记中解析出来）：

id_list = re.findall('<[^>]*? id="(.*?)"', html_text)

这样就可以很容易地只解析特定类型的 ID（匹配一些特殊模式）

score 0 · Accepted Answer

有一个明显（但丑陋）的正则表达式解决方案，您可以在其中使用urllib或urllib2获取页面，或者更方便的请求库，然后应用正则表达式，但我会推荐pyquery包。它就像jquery，但对于 python，使用 css 选择器来获取节点。

对于您的问题：

from pyquery import PyQuery

page = """
<html>
  <body id='test'>
    <p id='test2'>some text</p>
  </body>
</html>
"""

doc = PyQuery(page)
for node in doc("*[id]").items():
    print(node.attr.id)

将产生：

test
test2

并下载页面：

import requests
page = requests.get("http://www.google.fr").text

pyquery甚至可以打开网址，使用urllib或requests。

python - 如何使用 Python2.7 在网页上显示所有 ID 的值？

3 回答 3

Related

Reference