python - 解析 HTML 以编辑链接

Question

我正在尝试解析 HTML 文件（demo.html以使所有相对链接成为绝对链接。这是我尝试在 Python 脚本中执行此操作的方法 -

from bs4 import BeautifulSoup
f = open('demo.html', 'r')
html_text = f.read()
f.close()
soup = BeautifulSoup(html_text)
for a in soup.findAll('a'):
    for x in a.attrs:
        if x == 'href':
            temp = a[x]
            a[x] = "http://www.esplanade.com.sg" + temp
for a in soup.findAll('link'):
    for x in a.attrs:
        if x == 'href':
            temp = a[x]
            a[x] = "http://www.esplanade.com.sg" + temp
for a in soup.findAll('script'):
    for x in a.attrs:
        if x == 'src':
            temp = a[x]
            a[x] = "http://www.esplanade.com.sg" + temp
f = open("demo_result.html", "w")
f.write(soup.prettify().encode("utf-8"))

但是，输出文件demo_result.html包含许多意外更改。例如，

<script type="text/javascript" src="/scripts/ddtabmenu.js" />  
/***********************************************
* DD Tab Menu script- (c) Dynamic Drive DHTML code library (www.dynamicdrive.com)
* + Drop Down/ Overlapping Content- 
* This notice MUST stay intact for legal use
* Visit Dynamic Drive at http://www.dynamicdrive.com/ for full source code
***********************************************/   
</script>

更改为

 <script src="http://www.esplanade.com.sg/scripts/ddtabmenu.js" type="text/javascript">
 </script>
</head>
<body>
 <p>
  /***********************************************
   * DD Tab Menu script- (c) Dynamic Drive DHTML code library (www.dynamicdrive.com)
   * + Drop Down/ Overlapping Content- 
   * This notice MUST stay intact for legal use
   * Visit Dynamic Drive at http://www.dynamicdrive.com/ for full source code
   ***********************************************/

有人可以告诉我哪里出错了吗？

谢谢和最热烈的问候。

score 1 · Accepted Answer

它接缝了美丽的汤 4 只是将美丽汤降级到版本 3 你的问题将得到解决

import  BeautifulSoup      #This is version 3 not version 4
f = open('demo.html', 'r')
html_text = f.read()
f.close()
soup = BeautifulSoup.BeautifulSoup(html_text)
print soup.contents
for a in soup.findAll('a'):
    for x in a.attrs:
        if x == 'href':
            temp = a[x]
            a[x] = "http://www.esplanade.com.sg" + temp
for a in soup.findAll('link'):
    for x in a.attrs:
        if x == 'href':
            temp = a[x]
            a[x] = "http://www.esplanade.com.sg" + temp
for a in soup.findAll('script'):
    for x in a.attrs:
        if x == 'src':
            temp = a[x]
            a[x] = "http://www.esplanade.com.sg" + temp
f = open("demo_result.html", "w")
f.write(soup.prettify().encode("utf-8"))

score 0 · Accepted Answer

您的 HTML 代码有点乱。您已关闭script标签，您将再次关闭它

<script type="text/javascript" src="/scripts/ddtabmenu.js" /></script>

它破坏了 DOM。只需/从末尾删除<script type="text/javascript" src="/scripts/ddtabmenu.js" />

score 0 · Accepted Answer

如前所述，回归到 BeautifulSoup 3 可以解决问题。此外，添加这样的 url 会对 html 锚点和 javascript 引用产生问题，所以我更改了代码：

import re
import BeautifulSoup

with open("demo.html", "r") as file_h:
    soup = BeautifulSoup.BeautifulSoup(file_h.read())

url = "http://www.esplanade.com.sg/"
health_check = lambda x: bool(re.search("^(?!javascript:|http://)[/\w]", x))
replacer = lambda x: re.sub("^(%s)?/?" % url, url, x)

for soup_tag in soup.findAll(lambda x: x.name  in ["a", "img", "link", "script"]):

    if(soup_tag.has_key("href") and  health_check(soup_tag["href"])):
        soup_tag["href"] = replacer(soup_tag["href"])

    if(soup_tag.has_key("src") and health_check(soup_tag["src"])):
        soup_tag["src"] = replacer(soup_tag["src"])

with open("demo_result.html", "w") as file_h:
    file_h.write(soup.prettify().encode("utf-8"))

python - 解析 HTML 以编辑链接

3 回答 3

Related

Reference