1

我在使用 Python 和 BeautifulSoup4 时遇到了一个有趣的问题。我的方法通过给定的餐厅(字典键)获取当地学生餐厅当天的菜单,然后显示这些菜单。

def fetchFood(restaurant):
  # Restaurant id's
  restaurants = {'assari': 'restaurant_aghtdXJraW5hdHIaCxISX1Jlc3RhdXJhbnRNb2RlbFYzGMG4Agw', 'delica': 'restaurant_aghtdXJraW5hdHIaCxISX1Jlc3RhdXJhbnRNb2RlbFYzGPnPAgw', 'ict': 'restaurant_aghtdXJraW5hdHIaCxISX1Jlc3RhdXJhbnRNb2RlbFYzGPnMAww', 'mikro': 'restaurant_aghtdXJraW5hdHIaCxISX1Jlc3RhdXJhbnRNb2RlbFYzGOqBAgw', 'tottisalmi': 'restaurant_aghtdXJraW5hdHIaCxISX1Jlc3RhdXJhbnRNb2RlbFYzGMK7AQw'}

if restaurants.has_key(restaurant.lower()):
  soup = BeautifulSoup(urllib.urlopen("http://murkinat.appspot.com"))
  meal_div = soupie.find(id="%s" % restaurants[restaurant.lower()]).find_all("td", "mealName hyphenate")
  mealstring = "%s: " % restaurant
  for meal in meal_div:
    mealstring += "%s / " % meal.string.strip()
  mealstring = "%s @ %s" % (mealstring[:-3], "http://murkinat.appspot.com")
return mealstring

else:
  return "Restaurant not found"

它将成为我的 IRCBot 的一部分,但目前它只能在我的测试机器上运行(Ubuntu 12.04 和 Python 2.7.3),但在另一台运行机器人的机器上(Xubuntu 和 Python 2.6.5)它失败了。

线后

soup = BeautifulSoup(urllib.urlopen("http://murkinat.appspot.com"))

>>> type(soup)
<class 'bs4.BeautifulSoup'>

我可以打印它,它显示了所有应有的内容,但它可以找到任何东西。如果我这样做:

>>> print soup.find(True)
None

>>> soup.get_text()
u'?xml version="1.0" encoding="utf-8" ?'

它停止读取第一行,尽管在另一台机器上,它完美地读取了所有内容。

输出应该是这样的(在这个日期来自带有餐厅参数“Tottisalmi”的工作机器):

    Tottisalmi: Sveitsinleike, kermaperunat / Jauheliha-perunamusaka / Uuniperuna, kylmäsavulohitäytettä / Kermainen herkkusienikastike @ http://murkinat.appspot.com

I'm completely clueless with this. I have many similar kind of BeautifulSoup parsing methods that work just fine on the bot (it parses titles of urls and Wikipedia stuff) but this one keeps bugging me.

Does anyone have any idea? I can only come up with it having something to do with my Python version which sounds odd since in everywhere else BeautifulSoup4 works fine.

4

1 回答 1

2

I believe you have different parsers installed on the two machines. The html5lib parser fails on the given markup, giving the bad behavior. The lxml and html.parser parsers parse the markup correctly and don't give the bad behavior.

When writing code that will be run on multiple machines, it's best to explicitly state which parser you want to use:

BeautifulSoup(data, "lxml")

This way you'll get an error if the appropriate parser isn't installed.

于 2012-07-12T20:38:04.740 回答