python - 通过 Python 字符串替换从一堆 .html 文件中删除所有内部链接

Question

我想从一堆 .html 文件中删除所有内部链接。基本思想是任何以<a href=链接开头的内容，如果不是以链接开头，则为<a href="http内部链接。

我正在尝试编写一个小的 Python 脚本来完成此操作。现在每个文件的前半部分都完美地完成了，但它始终在同一个链接上崩溃。我显然检查了拼写错误或缺少</a>的，但我没有看到任何。如果我重新运行脚本，“问题链接”将被删除，但它</a>仍然存在。似乎越来越多的链接通过重新运行脚本被删除，但我希望所有内部链接在一次运行中被切断。

有没有人建议我做错了什么？请参阅下面的代码以了解我正在使用的代码。

tList = [r"D:\@work\projects_2013\@websites\pythonforspss\a44\@select-variables-having-pattern-in-names.html"]
for path in tList:
    readFil = open(path,"r")
    writeFil = open(path[:path.rfind("\\") +1] + "@" + path[path.rfind("\\") + 1:],"w")
    flag = 0
    for line in readFil:
        for ind in range(len(line)):
            if flag == 0:
                try:
                    if line[ind:ind + 8].lower() == '<a href=' and line[ind:ind + 13].lower() != '<a href="http':
                      flag = 1
                      sLine = line[ind:]
                      link = sLine[:sLine.find(">") + 1]
                      line = line.replace(link,"")
                      print link
                except:
                    pass
            if flag == 1:
                try:
                    if line[ind:ind + 4].lower() == '</a>':
                        flag = 0
                        line = line.replace('</a>',"")
                        print "</a>"
                except:
                    pass
        writeFil.write(line)
    readFil.close()
    writeFil.close()

score 1 · Accepted Answer

使用像BeautifulSoup或lxml这样的 HTML 解析器。使用lxml，您可能会执行以下操作：

import lxml.html as LH

url = 'http://stackoverflow.com/q/15186769/190597'
doc = LH.parse(url)

# Save a copy of the original just to compare with the altered version, below
with open('/tmp/orig.html', 'w') as f:
    f.write(LH.tostring(doc))

for atag in doc.xpath('//a[not(starts-with(@href,"http"))]'):
    parent = atag.getparent()
    parent.remove(atag)

with open('/tmp/altered.html', 'w') as f:
    f.write(LH.tostring(doc))

BeautifulSoup中的等价物如下所示：

import bs4 as bs
import urllib2

url = 'http://stackoverflow.com/q/15186769/190597'
soup = bs.BeautifulSoup(urllib2.urlopen(url))

with open('/tmp/orig.html', 'w') as f:
    f.write(str(soup))

for atag in soup.find_all('a', {'href':True}):
    if not atag['href'].startswith('http'):
        atag.extract()

with open('/tmp/altered.html', 'w') as f:
    f.write(str(soup))

score 0 · Accepted Answer

    query = input('Enter the word to be searched:')
url = 'https://google.com/search?q=' + query
request_result = req.get(url).text
soup = BS(request_result, 'lxml')
for link in soup.find_all('a', href= re.compile("https://")):
    print(link['href'].replace("/url?q=",""))

我在 Beautiful Soup 中使用了上面的代码，并且只成功地返回了 https 链接。

我尝试了上面发布的解决方案，但它对我不起作用事实上我的链接在使用上面的代码后在很大程度上减少了。

希望这可以帮助！

python - 通过 Python 字符串替换从一堆 .html 文件中删除所有内部链接

2 回答 2

Related

Reference