我正在使用 BeautifulSoup 解析 html。给定以下 HTML:
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: https://www.w3schools.com</p>
</body>
</html>
我希望将其转换为:
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: <a href="https://www.w3schools.com" target="_blank">https://www.w3schools.com</a></p>
</body>
</html>
到目前为止编写的代码:
def detect_urls_and_update_target(self, root): //root is the soup object
for tag in root.find_all(True):
if tag.name == 'a':
if not tag.has_attr('target'):
tag.attrs['target'] = '_blank'
elif tag.string is not None:
for url in re.findall(self.url_regex, tag.string): //regex which detects URLS which works
new_tag = root.new_tag("a", href=url, target="_blank")
new_tag.string = url
tag.append(new_tag)
这会添加所需的锚标记,但我无法弄清楚如何从标记中删除原始 URL。