python - Python Regex 找不到子字符串，但它应该

Question

我正在尝试使用 BeautifulSoup 解析 html 以尝试提取网页标题。有时这不起作用，因为网站写得不好，例如 Bad End 标签。当这不起作用时，我会去手动正则表达式

我有文字

<html xmlns="http://www.w3.org/1999/xhtml"\n      xmlns:og="http://ogp.me/ns#"\n      xmlns:fb="https://www.facebook.com/2008/fbml">\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>\n    <title>\n                    .@wolfblitzercnn prepping questions for the Cheney intvw. @CNNSitRoom today. 5p. \n            </title>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />...

我正在尝试获取<title>and</title>标签之间的值。它应该相当简单，但它不起作用。这是我的python代码。

result = re.search('\<title\>(.+?)\</title\>', html)
if result is not None:
    title = result.group(0)

无论出于何种原因，这都不适用于此文本。它返回 result.group() 作为 None 或者我得到一个 AttributeError。AttributeError：“NoneType”对象没有属性“组”

我已经将此文本 C&P'd 到在线 python 正则表达式开发人员中，并尝试了所有选项（re.match、re.findall、re.search），它们在那里工作，但无论出于何种原因，在我的脚本中它都无法在两者之间找到任何东西这些标签。甚至尝试其他正则表达式，例如

<title>(.*?)</title>

ETC

score 5 · Accepted Answer

您也应该使用dotall 标志来.匹配换行符。

result = re.search('\<title\>(.+?)\</title\>', html, re.DOTALL)

正如文档所说：

...没有此标志，'.'将匹配除换行符以外的任何内容

score 2 · Accepted Answer

如果你想在<title>and<\title>标记之间获取测试，你应该使用这个正则表达式：

pattern = "<title>([^<]+)</title>"

re.findall(pattern, html_string)

python - Python Regex 找不到子字符串，但它应该

2 回答 2

Related

Reference