0

所以,我有一个要保存为 pdf 的网页列表。它们就像http://nptel.ac.in/courses/115103028/module1/lec1/3.html。这个列表很长,这就是我使用 python 来自动化这个过程的原因。这是我的代码

import pdfkit
import urllib2

page = urllib2.urlopen('http://nptel.ac.in/courses/115103028/module1/lec1/3.html')

page_content = page.read()

with open('page_content.html', 'w') as fid:
    fid.write(page_content)

txt=open("page_content.html").read().split("\n")

txt1=""
for i in txt:
    if not ".html" in i:
        txt1+=i+"\n"

with open('page_content.html',"w") as f:
    f.write(txt1)


config = pdfkit.configuration(wkhtmltopdf="C:\Program Files (x86)\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
pdfkit.from_file('page_content.html', 'out.pdf',configuration=config)

但是我得到的输出pdf没有任何方程图像,只有文本。我该如何解决这个问题?另外,我第二次打开文件以删除网页顶部和底部的数字,您也可以帮助我改进这一点。

编辑:

这是我现在使用的代码

import os.path,pdfkit,bs4,urllib2,sys  
reload(sys)  
sys.setdefaultencoding('utf8')
url = 'http://nptel.ac.in/courses/115103028/module1/lec1/3.html'

directory, filename = os.path.split(url)

html_text = urllib2.urlopen(url).read()

html_text = html_text.replace('src="', 'src="'+directory+"/").replace('href="', 'href="'+directory+"/")

page = bs4.BeautifulSoup(html_text, "html5lib")
for ul in page.findAll("ul", {"id":"pagin"}):
    ul.extract() # Deletes the tag and everything inside it

html_text = str(page)
config = pdfkit.configuration(wkhtmltopdf="C:\Program Files (x86)\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
pdfkit.from_string(html_text, "out.pdf", configuration=config)

它仍然显示那些错误,错误消息的一部分,并且输出的pdf没有任何图像

Loading pages (1/6)
Warning: Failed to load http://nptel.ac.in/courses/115103028/css/style.css (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image041.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image042.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image043.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image045.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image046.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image048.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image049.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image050.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image051.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image052.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image053.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image054.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image055.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image056.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image057.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image064.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image065.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image067.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image068.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image069.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image070.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image071.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image072.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image073.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image074.png (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/1h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/2h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/3h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/4h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/5h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/6h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/7h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/8h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/9h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/10h.jpg (ignore)
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
Done
4

1 回答 1

1

当我运行您的代码时,pdfkit 会输出很多警告,如下所示:

Warning: Failed to load file:///C:/Users/.../images/image041.png (ignore)

pdfkit 尝试在我的计算机上查找网站中的图像,因为我没有下载它们,所以找不到它们。解决该问题的一个小技巧是将 HTML 源代码中的相对路径转换为绝对路径:

import os.path

url = 'http://nptel.ac.in/courses/115103028/module1/lec1/3.html'

directory, filename = os.path.split(url)

html_text = urllib2.urlopen(url).read()

html_text = html_text.replace('src="', 'src="'+directory+"/") \
                     .replace('href="', 'href="'+directory+"/")

directory是找到网站的目录,在这个例子中是http://nptel.ac.in/courses/115103028/module1/lec1这样的

<img src="images/image041.png" width="63" height="21">

变成

<img src="http://nptel.ac.in/courses/115103028/module1/lec1/images/image041.png" width="63" height="21">

现在您可以使用pdfkit.from_string而不是pdfkit.from_file创建 PDF 文件而不存储一些临时信息:

pdfkit.from_string(html_text, "out.pdf", configuration=config)

要从站点的顶部和底部删除指向其他页面的链接(显示为数字),您有很多可能性。我最喜欢的是用.BeautifulSoup来查找ul标签id="pagin"。这些标签包含指向其他页面的链接,您可以删除它们:

import bs4

page = bs4.BeautifulSoup(html_text)
for ul in page.findAll("ul", {"id":"pagin"}):
    ul.extract() # Deletes the tag and everything inside it

html_text = unicode(page)

现在html_text不再包含那些不需要的链接。要安装 BeautifulSoup,只需使用 pip:python -m pip install bs4

此解决方案显然仅在您的所有网站都以这种方式构建时才有效,如果不是,您也可以删除所有a标签以摆脱这些链接,但请注意不要删除想要的信息。

于 2017-04-01T15:24:52.773 回答