c# - 如何使用 iTextSharp 将 HTML 转换为 PDF

Question

我想使用 iTextSharp 将以下 HTML 转换为 PDF，但不知道从哪里开始：

<style>
.headline{font-size:200%}
</style>
<p>
  This <em>is </em>
  <span class="headline" style="text-decoration: underline;">some</span>
  <strong>sample<em> text</em></strong>
  <span style="color: red;">!!!</span>
</p>

score 157 · Accepted Answer

首先，虽然 HTML 和 PDF 是在同一时间创建的，但它们并不相关。HTML 旨在传达更高级别的信息，例如段落和表格。虽然有一些方法可以控制它，但最终要由浏览器来绘制这些更高级别的概念。PDF 旨在传达文档，并且无论呈现在何处，文档都必须“看起来”相同。

在 HTML 文档中，您可能有一个 100% 宽的段落，根据显示器的宽度，可能需要 2 行或 10 行，打印时可能需要 7 行，在手机上查看时可能走20行。然而，PDF 文件必须独立于渲染设备，因此无论您的屏幕大小如何，它都必须始终以完全相同的方式渲染。

由于上述要求，PDF 不支持“表格”或“段落”之类的抽象内容。PDF 支持三种基本的东西：文本、线条/形状和图像。（还有其他的东西，比如注释和电影，但我在这里尽量保持简单。）在 PDF 中，你不会说“这是一段，浏览器做你的事！”。相反，您会说，“使用这个确切的字体在这个确切的 X、Y 位置绘制这个文本，不用担心，我之前已经计算了文本的宽度，所以我知道它都适合这条线”。您也不会说“这是一张桌子”，而是说“在这个确切位置绘制此文本，然后在我的另一个确切位置绘制一个矩形”

其次，iText 和 iTextSharp 解析 HTML 和 CSS。而已。ASP.Net、MVC、Razor、Struts、Spring 等都是 HTML 框架，但 iText/iTextSharp 100% 不知道它们。与 DataGridViews、Repeater、模板、视图等相同，它们都是特定于框架的抽象。从您选择的框架中获取 HTML是您的责任，iText 不会帮助您。如果你得到一个例外，The document has no pages或者你认为“iText 没有解析我的 HTML”，那么几乎可以肯定你实际上没有 HTML，你只是认为你有。

第三，已经存在多年的内置类HTMLWorker被替换为XMLWorker（Java / .Net）。零工作正在进行HTMLWorker，不支持 CSS 文件，仅对最基本的 CSS 属性提供有限支持，实际上在某些标签上中断。如果您在此文件中没有看到HTML 属性或 CSS 属性和值，那么它可能不受HTMLWorker. XMLWorker有时可能会更复杂，但这些复杂性也使其更具可扩展性。

下面是 C# 代码，展示了如何将 HTML 标记解析为 iText 抽象，这些抽象会自动添加到您正在处理的文档中。C# 和 Java 非常相似，因此转换它应该相对容易。Example #1 使用内置的HTMLWorker来解析 HTML 字符串。由于仅支持内联样式，因此class="headline"会被忽略，但其他所有内容都应该可以正常工作。Example #2 与第一个相同，只是它使用了它XMLWorker。Example #3 还解析了简单的 CSS 示例。

//Create a byte array that will eventually hold our final PDF
Byte[] bytes;

//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {

    //Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
    using (var doc = new Document()) {

        //Create a writer that's bound to our PDF abstraction and our stream
        using (var writer = PdfWriter.GetInstance(doc, ms)) {

            //Open the document for writing
            doc.Open();

            //Our sample HTML and CSS
            var example_html = @"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
            var example_css = @".headline{font-size:200%}";

            /**************************************************
             * Example #1                                     *
             *                                                *
             * Use the built-in HTMLWorker to parse the HTML. *
             * Only inline CSS is supported.                  *
             * ************************************************/

            //Create a new HTMLWorker bound to our document
            using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {

                //HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
                using (var sr = new StringReader(example_html)) {

                    //Parse the HTML
                    htmlWorker.Parse(sr);
                }
            }

            /**************************************************
             * Example #2                                     *
             *                                                *
             * Use the XMLWorker to parse the HTML.           *
             * Only inline CSS and absolutely linked          *
             * CSS is supported                               *
             * ************************************************/

            //XMLWorker also reads from a TextReader and not directly from a string
            using (var srHtml = new StringReader(example_html)) {

                //Parse the HTML
                iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
            }

            /**************************************************
             * Example #3                                     *
             *                                                *
             * Use the XMLWorker to parse HTML and CSS        *
             * ************************************************/

            //In order to read CSS as a string we need to switch to a different constructor
            //that takes Streams instead of TextReaders.
            //Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
            using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
                using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {

                    //Parse the HTML
                    iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
                }
            }


            doc.Close();
        }
    }

    //After all of the PDF "stuff" above is done and closed but **before** we
    //close the MemoryStream, grab all of the active bytes from the stream
    bytes = ms.ToArray();
}

//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);

2017年的更新

对于 HTML 到 PDF 的需求，有好消息。正如这个答案所示，W3C 标准css-break-3将解决这个问题......这是一个候选推荐，计划在今年经过测试后变成最终推荐。

正如print-css.rocks 所示，有一些解决方案不那么标准，带有 C# 插件。

score 17 · Accepted Answer

截至 2018 年，还有iText7（旧 iTextSharp 库的下一次迭代）及其 HTML 到 PDF 包可用：itext7.pdfhtml

用法很简单：

HtmlConverter.ConvertToPdf(
    new FileInfo(@"Path\to\Html\File.html"),
    new FileInfo(@"Path\to\Pdf\File.pdf")
);

方法有更多的重载。

更新： iText* 系列产品具有双重许可模式：开源免费，商业使用付费。

score 9 · Accepted Answer

@Chris Haas 很好地解释了如何使用itextSharp转换HTML为PDF，我的补充非常有帮助
：
通过使用HtmlTextWriter我将 html 标签放在HTML表格 + 内联 CSS 中，我得到了我想要的 PDF，而无需使用XMLWorker.
编辑：添加示例代码：
ASPX 页面：

<asp:Panel runat="server" ID="PendingOrdersPanel">
 <!-- to be shown on PDF-->
 <table style="border-spacing: 0;border-collapse: collapse;width:100%;display:none;" >
 <tr><td><img src="abc.com/webimages/logo1.png" style="display: none;" width="230" /></td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
 <tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla.</td></tr>
 <tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:9px;color:#10466E;padding:0px;text-align:right;">blablabla</td></tr>
<tr style="line-height:10px;height:10px;"><td style="display:none;font-size:11px;color:#10466E;padding:0px;text-align:center;"><i>blablabla</i> Pending orders report<br /></td></tr>
 </table>
<asp:GridView runat="server" ID="PendingOrdersGV" RowStyle-Wrap="false" AllowPaging="true" PageSize="10" Width="100%" CssClass="Grid" AlternatingRowStyle-CssClass="alt" AutoGenerateColumns="false"
   PagerStyle-CssClass="pgr" HeaderStyle-ForeColor="White" PagerStyle-HorizontalAlign="Center" HeaderStyle-HorizontalAlign="Center" RowStyle-HorizontalAlign="Center" DataKeyNames="Document#" 
      OnPageIndexChanging="PendingOrdersGV_PageIndexChanging" OnRowDataBound="PendingOrdersGV_RowDataBound" OnRowCommand="PendingOrdersGV_RowCommand">
   <EmptyDataTemplate><div style="text-align:center;">no records found</div></EmptyDataTemplate>
    <Columns>                                           
     <asp:ButtonField CommandName="PendingOrders_Details" DataTextField="Document#" HeaderText="Document #" SortExpression="Document#" ItemStyle-ForeColor="Black" ItemStyle-Font-Underline="true"/>
      <asp:BoundField DataField="Order#" HeaderText="order #" SortExpression="Order#"/>
     <asp:BoundField DataField="Order Date" HeaderText="Order Date" SortExpression="Order Date" DataFormatString="{0:d}"></asp:BoundField> 
    <asp:BoundField DataField="Status" HeaderText="Status" SortExpression="Status"></asp:BoundField>
    <asp:BoundField DataField="Amount" HeaderText="Amount" SortExpression="Amount" DataFormatString="{0:C2}"></asp:BoundField> 
   </Columns>
    </asp:GridView>
</asp:Panel>

C#代码：

protected void PendingOrdersPDF_Click(object sender, EventArgs e)
{
    if (PendingOrdersGV.Rows.Count > 0)
    {
        //to allow paging=false & change style.
        PendingOrdersGV.HeaderStyle.ForeColor = System.Drawing.Color.Black;
        PendingOrdersGV.BorderColor = Color.Gray;
        PendingOrdersGV.Font.Name = "Tahoma";
        PendingOrdersGV.DataSource = clsBP.get_PendingOrders(lbl_BP_Id.Text);
        PendingOrdersGV.AllowPaging = false;
        PendingOrdersGV.Columns[0].Visible = false; //export won't work if there's a link in the gridview
        PendingOrdersGV.DataBind();

        //to PDF code --Sam
        string attachment = "attachment; filename=report.pdf";
        Response.ClearContent();
        Response.AddHeader("content-disposition", attachment);
        Response.ContentType = "application/pdf";
        StringWriter stw = new StringWriter();
        HtmlTextWriter htextw = new HtmlTextWriter(stw);
        htextw.AddStyleAttribute("font-size", "8pt");
        htextw.AddStyleAttribute("color", "Grey");

        PendingOrdersPanel.RenderControl(htextw); //Name of the Panel
        Document document = new Document();
        document = new Document(PageSize.A4, 5, 5, 15, 5);
        FontFactory.GetFont("Tahoma", 50, iTextSharp.text.BaseColor.BLUE);
        PdfWriter.GetInstance(document, Response.OutputStream);
        document.Open();

        StringReader str = new StringReader(stw.ToString());
        HTMLWorker htmlworker = new HTMLWorker(document);
        htmlworker.Parse(str);

        document.Close();
        Response.Write(document);
    }
}

当然包括 iTextSharp 引用到 cs 文件

using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
using iTextSharp.tool.xml;

希望这可以帮助！
谢谢

score 3 · Accepted Answer

我使用以下代码创建 PDF

protected void CreatePDF(Stream stream)
        {
            using (var document = new Document(PageSize.A4, 40, 40, 40, 30))
            {
                var writer = PdfWriter.GetInstance(document, stream);
                writer.PageEvent = new ITextEvents();
                document.Open();

                // instantiate custom tag processor and add to `HtmlPipelineContext`.
                var tagProcessorFactory = Tags.GetHtmlTagProcessorFactory();
                tagProcessorFactory.AddProcessor(
                    new TableProcessor(),
                    new string[] { HTML.Tag.TABLE }
                );

                //Register Fonts.
                XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
                fontProvider.Register(HttpContext.Current.Server.MapPath("~/Content/Fonts/GothamRounded-Medium.ttf"), "Gotham Rounded Medium");
                CssAppliers cssAppliers = new CssAppliersImpl(fontProvider);

                var htmlPipelineContext = new HtmlPipelineContext(cssAppliers);
                htmlPipelineContext.SetTagFactory(tagProcessorFactory);

                var pdfWriterPipeline = new PdfWriterPipeline(document, writer);
                var htmlPipeline = new HtmlPipeline(htmlPipelineContext, pdfWriterPipeline);

                // get an ICssResolver and add the custom CSS
                var cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);
                cssResolver.AddCss(CSSSource, "utf-8", true);
                var cssResolverPipeline = new CssResolverPipeline(
                    cssResolver, htmlPipeline
                );

                var worker = new XMLWorker(cssResolverPipeline, true);
                var parser = new XMLParser(worker);
                using (var stringReader = new StringReader(HTMLSource))
                {
                    parser.Parse(stringReader);
                    document.Close();
                    HttpContext.Current.Response.ContentType = "application /pdf";
                    if (base.View)
                        HttpContext.Current.Response.AddHeader("content-disposition", "inline;filename=\"" + OutputFileName + ".pdf\"");
                    else
                        HttpContext.Current.Response.AddHeader("content-disposition", "attachment;filename=\"" + OutputFileName + ".pdf\"");
                    HttpContext.Current.Response.Cache.SetCacheability(HttpCacheability.NoCache);
                    HttpContext.Current.Response.WriteFile(OutputPath);
                    HttpContext.Current.Response.End();
                }
            }
        }

score -5 · Accepted Answer

这是我用作指南的链接。希望这可以帮助！

使用 ITextSharp 将 HTML 转换为 PDF

protected void Page_Load(object sender, EventArgs e)
    {
        try
        {
            string strHtml = string.Empty;
            //HTML File path -http://aspnettutorialonline.blogspot.com/
            string htmlFileName = Server.MapPath("~") + "\\files\\" + "ConvertHTMLToPDF.htm";
            //pdf file path. -http://aspnettutorialonline.blogspot.com/
            string pdfFileName = Request.PhysicalApplicationPath + "\\files\\" + "ConvertHTMLToPDF.pdf";

            //reading html code from html file
            FileStream fsHTMLDocument = new FileStream(htmlFileName, FileMode.Open, FileAccess.Read);
            StreamReader srHTMLDocument = new StreamReader(fsHTMLDocument);
            strHtml = srHTMLDocument.ReadToEnd();
            srHTMLDocument.Close();

            strHtml = strHtml.Replace("\r\n", "");
            strHtml = strHtml.Replace("\0", "");

            CreatePDFFromHTMLFile(strHtml, pdfFileName);

            Response.Write("pdf creation successfully with password -http://aspnettutorialonline.blogspot.com/");
        }
        catch (Exception ex)
        {
            Response.Write(ex.Message);
        }
    }
    public void CreatePDFFromHTMLFile(string HtmlStream, string FileName)
    {
        try
        {
            object TargetFile = FileName;
            string ModifiedFileName = string.Empty;
            string FinalFileName = string.Empty;

            /* To add a Password to PDF -http://aspnettutorialonline.blogspot.com/ */
            TestPDF.HtmlToPdfBuilder builder = new TestPDF.HtmlToPdfBuilder(iTextSharp.text.PageSize.A4);
            TestPDF.HtmlPdfPage first = builder.AddPage();
            first.AppendHtml(HtmlStream);
            byte[] file = builder.RenderPdf();
            File.WriteAllBytes(TargetFile.ToString(), file);

            iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(TargetFile.ToString());
            ModifiedFileName = TargetFile.ToString();
            ModifiedFileName = ModifiedFileName.Insert(ModifiedFileName.Length - 4, "1");

            string password = "password";
            iTextSharp.text.pdf.PdfEncryptor.Encrypt(reader, new FileStream(ModifiedFileName, FileMode.Append), iTextSharp.text.pdf.PdfWriter.STRENGTH128BITS, password, "", iTextSharp.text.pdf.PdfWriter.AllowPrinting);
            //http://aspnettutorialonline.blogspot.com/
            reader.Close();
            if (File.Exists(TargetFile.ToString()))
                File.Delete(TargetFile.ToString());
            FinalFileName = ModifiedFileName.Remove(ModifiedFileName.Length - 5, 1);
            File.Copy(ModifiedFileName, FinalFileName);
            if (File.Exists(ModifiedFileName))
                File.Delete(ModifiedFileName);

        }
        catch (Exception ex)
        {
            throw ex;
        }
    }

您可以下载示例文件。只需将html要转换的文件放在files文件夹中并运行。它将自动生成pdf文件并将其放在同一文件夹中。htmlFileName但在您的情况下，您可以在变量中指定您的 html 路径。

c# - 如何使用 iTextSharp 将 HTML 转换为 PDF

5 回答 5

2017年的更新

Related

Reference