python - 使用 python 正则表达式抓取 html

Question

我对python中的正则表达式有一些问题。我有一些 html 页面，其中包含对我有用的信息。在保存页面时，encodig 字符集是一种 iso... 它保存了所有德国典型字母编码，例如。比如 Früchte 和儿子的“Fr%C3%BCchte”。html 的结构非常糟糕，因此唯一合理的方法是使用正则表达式来抓取它。

我在 python 中有这个正则表达式：

re.compile('<a\s+href="javascript.*?\(\'(\w+).*?\s.(\d+.+\d+).*?(.*)\'\)\">')

不幸的是，这并不是我真正想要的，因为编码的单词只会被部分提取，例如。结果将是：

[('showSubGroups', "160500', 'Fr%C3", '%BCchte in Alkohol'),
 ('showSubGroups', '160400', "', 'Rumtopf"),
 ('showSubGroups', '160300', "', 'Spirituosen (Bio)"),
 ('showSubGroups', '160200', "', 'Spirituosen zur Verarbeitung in der Confiserie"),
 ('showSubGroups', '160100', "', 'Spirituosen, allgemein")]

也许我累了，但我看不出错误在哪里：

使用 html：

<td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160500', 'Fr%C3%BCchte in Alkohol')">Früchte in Alkohol</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160400', 'Rumtopf')">Rumtopf</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160300', 'Spirituosen (Bio)')">Spirituosen (Bio)</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160200', 'Spirituosen zur Verarbeitung in der Confiserie')">Spirituosen zur Verarbeitung in der Confiserie</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>       <tr valign="top">
        <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160100', 'Spirituosen, allgemein')">Spirituosen, allgemein</a></td>
       </tr>
       <tr valign="top">
        <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td>
       </tr>                </tbody></table>
            </td>
        </tr>

score 1 · Accepted Answer

尝试这个：

f = re.compile("sendForm\((?:.*), (.*), (.*)\)")

以您的文本作为输入，它提供以下内容：

In [7]: f.findall(txt)
Out[7]:  [('160500', 'Fr%C3%BCchte in Alkohol'), ('160400', 'Rumtopf'), ('160300', 'Spirituosen (Bio)'), ('160200', 'Spirituosen zur Verarbeitung in der Confiserie'), ('160100', 'Spirituosen, allgemein')]

就解码%C3%BC(for 'ü') 而言，它似乎只是来自拉丁语 1 块的 UTF-8，并添加了一些额外的 '%'，因为如果将 '%' 替换为 '\x'，它就会解码：

In [39]: '\xC3\xBC'.decode('utf-8')
Out[39]: u'\xfc'

0x00FC 是 ü 的 unicode。

score 0 · Accepted Answer

Beautiful Soup是一个很棒的 html 解析库。

一旦你从 html 中提取了 href，那么使用正则表达式应该很容易。

python - 使用 python 正则表达式抓取 html

2 回答 2

Related

Reference