2

我决定做这个小项目来学习如何使用机械化。现在它进入urbandictionary,在搜索表单中填写“skid”一词,然后按提交并打印出HTML。

我想要它做的是找到第一个定义并将其打印出来。我该怎么做呢?

到目前为止,这是我的源代码:

import mechanize

br = mechanize.Browser()
page = br.open("http://www.urbandictionary.com/")

br.select_form(nr=0)
br["term"] = "skid"
br.submit()

print br.response().read()

这是存储定义的位置:

<div class="definition">Canadian definition: Commonly used to refer to someone   who      stopped evolving, and bathing, during the 80&#x27;s hair band era.  Generally can be found wearing AC/DC muscle shirts, leather jackets, and sporting a <a href="/define.php?term=mullet">mullet</a>.  The term &quot;skid&quot; is in part derived from &quot;skid row&quot;, which is both a band enjoyed by those the term refers to, as well as their address.  See also <a href="/define.php?term=white%20trash">white trash</a> and <a href="/define.php?term=trailer%20park%20trash">trailer park trash</a></div><div class="example">The skid next door got drunk and beat up his old lady.</div>

您可以看到它存储在 div 定义中。我知道如何在源代码中搜索 div,但我不知道如何获取标签之间的所有内容然后显示它。

4

3 回答 3

1

我想正则表达式足以完成这项任务(根据您的描述)。试试这个代码:

import mechanize, re

br = mechanize.Browser()
page = br.open("http://www.urbandictionary.com/")

br.select_form(nr=0)
br["term"] = "skid"
br.submit()

source =  br.response().read()

regex = "<div class=\"definition\">(.+?)</div>"
pattern = re.compile(regex)
r=re.findall(pattern,source)
print r[0]

这将显示标签之间的内容(没有示例部分,但它们完全相同),但我不知道您想如何处理此内容中的标签。如果你想让他们在那里,就是这样。或者如果你想删除它们,你可以使用 re.replace() 之类的东西。

于 2013-08-23T15:42:09.720 回答
1

既然提到了,我想我会提供一个BeautifulSoup答案。使用最有效的。

import bs4, urllib2

# Use urllib2 to get the html from the web
url     = r"http://www.urbandictionary.com/define.php?term={term}"
request = url.format(term="skid")
raw     = urllib2.urlopen(request).read()

# Convert it into a soup
soup    = bs4.BeautifulSoup(raw)

# Find the requested info
for word_def in soup.findAll(class_ = 'definition'):
    print word_def.string
于 2013-08-23T19:35:01.933 回答
0

您可以使用lxml来解析 HTML 片段:

import lxml.html as html
import mechanize

br = mechanize.Browser()
page = br.open("http://www.urbandictionary.com/")

br.select_form(nr=0)
br["term"] = "skid"
br.submit()

fragment = html.fromstring(br.response().read())

print fragment.find_class('definition')[0].text_content()

但是,此解决方案会删除 div 内的标签并展平文本。

于 2013-08-23T16:14:43.857 回答