1

我正在尝试从 wiki 页面上抓取一些文本,特别是这个。 我正在使用 BeautifulSoup,或者至少尝试......我对网页抓取并不真正有经验。到目前为止,这是我的代码...

import urllib
import urllib.request
from bs4 import BeautifulSoup

soup =BeautifulSoup(urllib.request.urlopen('http://yugioh.wikia.com/wiki/Card_Tips:Blue-Eyes_White_Dragon').read())

for row in soup('span', {'class' : 'mw-headline'})[0].tbody('tr'):
      tds = row('td')
      print(tds[0].string, tds[1].string, tds[2].string)

我只是想获取每个标题(可搜索,从手上特殊召唤等)并获取每个类别下的每张卡。谁能给我一些建议?

4

2 回答 2

3

如果您检查 HTML 代码,您会发现:

<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en">
 ...
 <h3>
  <span class="mw-headline" id="Searchable_by">
   Searchable by
  </span>
 ...
 </h3>
 <ul>
  <li>
   "
   <a href="/wiki/Summoner%27s_Art" title="Summoner's Art">
    Summoner's Art
   </a>
   "
  </li>
  <li>
   "
   <a href="/wiki/The_White_Stone_of_Legend" title="The White Stone of Legend">
    The White Stone of Legend
   </a>
   "
  ...
  </li>
 </ul>
 ...
<\div>

上面的片段显示了这样一个事实:

  • 一个divwithid="mw-content-text"包含 wiki。
  • 标题在h3标签的第一个(也是唯一的)span中。
  • 一个ul标签包含项目符号列表。

所以在 Python 代码中:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('stack.htm').read()) # I saved the webpage
main_tag = soup.findAll('div',{'id':'mw-content-text'})[0]

headers = main_tag.find_all('h3')
ui_list = main_tag.find_all('ul')
for i in range(len(headers)):
    print(headers[i].span.get_text())
    print('\n -'.join(ui_list[i].get_text().split('\n')))
sections = zip((x.span.get_text() for x in headers), ('\n -'.join(x.get_text().split('\n')) for x in ui_list))
于 2013-04-04T15:21:16.263 回答
1

您想查找<ul>标题后面的所有元素,然后列出这些元素下的链接以获取卡片:

for headline in soup('span', {'class' : 'mw-headline'}):
    print(headline.text)
    links = headline.find_next('ul').find_all('a')
    for link in links:
        print('*', link.text)        

打印:

Searchable by
* Summoner's Art
* The White Stone of Legend
* Deep Diver
Special Summoned from the hand by
* Ancient Rules
* Red-Eyes Darkness Metal Dragon
* King Dragun
* Kaibaman

等等

于 2013-04-04T15:35:50.310 回答