0

我有一个将图像转换为 word 文档的 OCR 程序。word文档包含所有图像的文本,我想将其拆分为单独的文件。

有没有办法在 c# 中做到这一点?

谢谢

4

3 回答 3

5

其他答案相同,但具有 IEnumerator 和文档的扩展方法。

static class PagesExtension {
    public static IEnumerable<Range> Pages(this Document doc) {
        int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument];
        int pageStart = 0;
        for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) {
            var page = doc.Range(
                pageStart
            );
            if (currentPageIndex < pageCount) {
                //page.GoTo returns a new Range object, leaving the page object unaffected
                page.End = page.GoTo(
                    What: WdGoToItem.wdGoToPage,
                    Which: WdGoToDirection.wdGoToAbsolute,
                    Count: currentPageIndex+1
                ).Start-1;
            } else {
                page.End = doc.Range().End;
            }
            pageStart = page.End + 1;
            yield return page;
        }
        yield break;
    }
}

主要代码最终是这样的:

static void Main(string[] args) {
    var app = new Application();
    app.Visible = true;
    var doc = app.Documents.Open(@"path\to\source\document");
    foreach (var page in doc.Pages()) {
        page.Copy();
        var doc2 = app.Documents.Add();
        doc2.Range().Paste();
    }
}
于 2012-08-02T07:01:59.053 回答
4

如果您安装了 Word,则可以使用 Word 对象模型从 C# 操作 Word 文档。

首先,添加对 Word 对象模型的引用。右键单击项目,然后Add Reference... -> COM -> Microsoft Word 14.0 Object Model(或类似的东西,取决于您的 Word 版本)。

然后,您可以使用以下代码:

using Microsoft.Office.Interop.Word;
//for older versions of Word use:
//using Word;

namespace WordSplitter {
    class Program {
        static void Main(string[] args) {
            //Create a new instance of Word
            var app = new Application();

            //Show the Word instance.
            //If the code runs too slowly, you can show the application at the end of the program
            //Make sure it works properly first; otherwise, you'll get an error in a hidden window
            //(If it still runs too slowly, there are a few other ways to reduce screen updating)
            app.Visible = true;

            //We need a reference to the source document
            //It should be possible to get a reference to an open Word document, but I haven't tried it
            var doc = app.Documents.Open(@"path\to\file.doc");
            //(Can also use .docx)

            int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument];

            //We'll hold the start position of each page here
            int pageStart = 0;

            for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) {
                //This Range object will contain each page.
                var page = doc.Range(pageStart);

                //Generally, the end of the current page is 1 character before the start of the next.
                //However, we need to handle the last page -- since there is no next page, the 
                //GoTo method will move to the *start* of the last page.
                if (currentPageIndex < pageCount) {
                    //page.GoTo returns a new Range object, leaving the page object unaffected
                    page.End = page.GoTo(
                        What: WdGoToItem.wdGoToPage,
                        Which: WdGoToDirection.wdGoToAbsolute,
                        Count: currentPageIndex + 1
                    ).Start - 1;
                } else {
                    page.End = doc.Range().End;
                }
                pageStart = page.End + 1;

                //Copy and paste the contents of the Range into a new document
                page.Copy();
                var doc2 = app.Documents.Add();
                doc2.Range().Paste();
            }
        }
    }
}

参考:MSDN 上的 Word 对象模型概述

于 2012-08-02T06:56:03.267 回答
0

虽然 Word 使用 w:lastRenderedPageBreak 创建文档,但在 Word 文档末尾并不容易。

最好让您的 OCR 程序在每个转换文本块之间的文档中插入一些标记。

然后,根据它是哪种 Word 文档,使用适当的工具处理该文件。

于 2012-08-01T22:18:48.300 回答