python - Python中的BeautifulSoup解析不正确

Question

我正在运行 Python 2.7.5 并使用内置的 html 解析器来完成我将要描述的内容。

我要完成的任务是获取一大块 html，它本质上是一个食谱。这是一个例子。

html_chunk = "<h1>Miniature Potato Knishes</h1>Posted by bettyboop50 at recipegoldmine.com May 10, 2001Makes about 42 miniature knishesThese are just yummy for your tummy!3 cups mashed potatoes (about     2 very large potatoes) 2 eggs, slightly beaten 1 large onion, diced 2 tablespoons margarine 1 teaspoon salt (or to taste) 1/8 teaspoon black pepper 3/8 cup Matzoh meal 1 egg yolk, beaten with 1 tablespoon waterPreheat oven to 400 degrees F.Sauté diced onion in a small amount of butter or margarine until golden brown.In medium bowl, combine mashed potatoes, sautéed onion, eggs, margarine, salt, pepper, and Matzoh meal.Form mixture into small balls about the size of a walnut. Brush with egg yolk mixture and place on a well-greased baking sheet and bake for 20 minutes or until well browned."

目标是分离出标题、垃圾、成分、说明、服务和成分数量。

这是我的代码，它完成了

from bs4 import BeautifulSoup

def list_to_string(list):
   joined = ""
   for item in list:
      joined += str(item)
   return joined

def get_ingredients(soup):
   for p in soup.find_all('p'):
      if p.find('br'):
         return p

def get_instructions(p_list, ingredient_index):
   instructions = []
   instructions += p_list[ingredient_index+1:]
   return instructions

def get_junk(p_list, ingredient_index):
   junk = []
   junk += p_list[:ingredient_index]
   return junk

def get_serving(p_list):
   for item in p_list:
      item_str = str(item).lower()
      if ("yield" or "make" or "serve" or "serving") in item_str:
         yield_index = p_list.index(item)
         del p_list[yield_index]
         return item

def ingredients_count(ingredients):
   ingredients_list = ingredients.find_all(text=True)
   return len(ingredients_list)

def get_header(soup):
   return soup.find('h1')

def html_chunk_splitter(soup):
   ingredients = get_ingredients(soup)
   if ingredients == None:
      error = 1
      header = ""
      junk_string = ""
      instructions_string = ""
      serving = ""
      count = ""
   else:
      p_list = soup.find_all('p')
      serving = get_serving(p_list)
      ingredient_index = p_list.index(ingredients)
      junk_list = get_junk(p_list, ingredient_index)
      instructions_list = get_instructions(p_list, ingredient_index)
      junk_string = list_to_string(junk_list)
      instructions_string = list_to_string(instructions_list)
      header = get_header(soup)
      error = ""
      count = ingredients_count(ingredients)
   return (header, junk_string, ingredients, instructions_string, 
   serving, count, error)

它工作得很好，除了在我有包含字符串的块的情况下，"Sauté"因为soup = BeautifulSoup(html_chunk)会导致 Sauté 变成 Sauté，这是一个问题，因为我有一个巨大的 csv 食谱文件，比如 html_chunk 并且我正在尝试很好地构造它们然后将输出返回到数据库中。我尝试使用这个html 预览器检查 Sauté 是否正确，但它仍然以 Sauté 的形式出现。我不知道该怎么办。

奇怪的是，当我做 BeautifulSoup 的文档显示的

BeautifulSoup("Sacr&eacute; bleu!")
# <html><head></head><body>Sacré bleu!</body></html>

我明白了

# Sacr├⌐ bleu!

但是我的同事在他的 Mac 上尝试过，从终端运行，他得到了文档显示的内容。

我真的很感谢你的帮助。谢谢你。

score 0 · Accepted Answer

BeautifulSoup 尝试猜测编码，有时会出错，但是您可以通过添加from_encoding参数来指定编码：例如

soup = BeautifulSoup(html_text, from_encoding="UTF-8")

编码通常在网页的标题中可用

score 0 · Accepted Answer

这不是解析问题；而是关于编码。

每当处理可能包含非 ASCII 字符的文本时（或在包含此类字符的 Python 程序中，例如在注释或文档字符串中），您应该在第一行或 - 在 shebang 行之后 - 第二行放置一个编码 cookie：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

...并确保这与您的文件编码匹配（使用 vim: :set fenc=utf-8）。

python - Python中的BeautifulSoup解析不正确

2 回答 2

Related

Reference