vba - 如何以编程方式遍历 Word 文档中的下标、上标和方程式

Question

我有几个 Word 文档，每个文档包含数百页的科学数据，其中包括：

化学式（H2SO4 和所有正确的下标和上标）
科学数字（使用上标格式化的指数）
大量的数学方程。使用 Word 中的数学方程式编辑器编写。

问题是，以 Word 的形式存储这些数据对我们来说效率不高。因此，我们希望将所有这些信息存储在数据库（MySQL）中。我们想将这些格式转换为 LaTex。

有没有办法使用 VBA 遍历所有下标和上标和方程？

遍历数学方程怎么样？

score 10 · Accepted Answer

根据您对迈克尔回答的评论

不！我只想用 _{ subscriptcontent } 替换下标中的内容，并用 ^{ superscriptcontent } 替换类似的上标内容。那将是Tex等价物。现在，我将所有内容复制到一个文本文件中，该文件将删除格式但保留这些字符。问题解决了。但为此我需要访问文档的下标和上标对象

Sub sampler()
    Selection.HomeKey wdStory
    With Selection.find
        .ClearFormatting
        .Replacement.ClearFormatting
        .Font.Superscript = True
        .Replacement.Text = "^^{^&}"
        .Execute Replace:=wdReplaceAll
        .Font.Subscript = True
        .Replacement.Text = "_{^&}"
        .Execute Replace:=wdReplaceAll
    End With
End Sub

编辑

或者，如果您还想转换OMaths为TeX / LaTeX，请执行以下操作：

遍历 Omaths > 将每个转换为 MathML > [将 MathML 保存到磁盘] + [在描述 MathML 文件引用的文档中放置一些标记代替 OMath] > 将 Word 文件转换为文本
现在准备一个像MathParser这样的转换器并将 MathML 文件转换为 LateX。
解析文本文件 > 相应地搜索和替换 LaTeX 代码。

对于完全不同的想法，请访问David Carlisle 的博客，这可能会让您感兴趣。

更新

The module

Option Explicit

'This module requires the following references:
'Microsoft Scripting Runtime
'MicroSoft XML, v6.0

Private fso As New Scripting.FileSystemObject
Private omml2mml$, mml2Tex$

Public Function ProcessFile(fpath$) As Boolean
    'convPath set to my system at (may vary on your system):
    omml2mml = "c:\program files\microsoft office\office14\omml2mml.xsl"
    'download: http://prdownloads.sourceforge.net/xsltml/xsltml_2.0.zip
    'unzip at «c:\xsltml_2.0»
    mml2Tex = "c:\xsltml_2.0\mmltex.xsl"

    Documents.Open fpath

    'Superscript + Subscript
    Selection.HomeKey wdStory
    With Selection.find
        .ClearFormatting
        .Replacement.ClearFormatting

        'to make sure no paragraph should contain any emphasis
        .Text = "^p"
        .Replacement.Text = "^&"
        .Replacement.Font.Italic = False
        .Replacement.Font.Bold = False
        .Replacement.Font.Superscript = False
        .Replacement.Font.Subscript = False
        .Replacement.Font.SmallCaps = False
        .Execute Replace:=wdReplaceAll


        .Font.Italic = True
        .Replacement.Text = "\textit{^&}"
        .Execute Replace:=wdReplaceAll

        .Font.Bold = True
        .Replacement.Text = "\textbf{^&}"
        .Execute Replace:=wdReplaceAll

        .Font.SmallCaps = True
        .Replacement.Text = "\textsc{^&}"
        .Execute Replace:=wdReplaceAll


        .Font.Superscript = True
        .Replacement.Text = "^^{^&}"
        .Execute Replace:=wdReplaceAll


        .Font.Subscript = True
        .Replacement.Text = "_{^&}"
        .Execute Replace:=wdReplaceAll
    End With

    Dim dict As New Scripting.Dictionary
    Dim om As OMath, t, counter&, key$
    key = Replace(LCase(Dir(fpath)), " ", "_omath_")
    counter = 0

    For Each om In ActiveDocument.OMaths
        DoEvents
        counter = counter + 1
        Dim tKey$, texCode$
        tKey = "<" & key & "_" & counter & ">"
        t = om.Range.WordOpenXML

        texCode = TransformString(TransformString(CStr(t), omml2mml), mml2Tex)
        om.Range.Select
        Selection.Delete
        Selection.Text = tKey

        dict.Add tKey, texCode

    Next om

    Dim latexDoc$, oPath$
    latexDoc = "\documentclass[10pt]{article}" & vbCrLf & _
                "\usepackage[utf8]{inputenc} % set input encoding" & vbCrLf & _
                "\usepackage{amsmath,amssymb}" & vbCrLf & _
                "\begin{document}" & vbCrLf & _
                "###" & vbCrLf & _
                "\end{document}"

    oPath = StrReverse(Mid(StrReverse(fpath), InStr(StrReverse(fpath), "."))) & "tex"
    'ActiveDocument.SaveAs FileName:=oPath, FileFormat:=wdFormatText, Encoding:=1200
    'ActiveDocument.SaveAs FileName:=oPath, FileFormat:=wdFormatText, Encoding:=65001
    ActiveDocument.Close

    Dim c$, i
    c = fso.OpenTextFile(oPath).ReadAll()

    counter = 0

    For Each i In dict
        counter = counter + 1
        Dim findText$, replaceWith$
        findText = CStr(i)
        replaceWith = dict.item(i)
        c = Replace(c, findText, replaceWith, 1, 1, vbTextCompare)
    Next i

    latexDoc = Replace(latexDoc, "###", c)

    Dim ost As TextStream
    Set ost = fso.CreateTextFile(oPath)
    ost.Write latexDoc

    ProcessFile = True


End Function

Private Function CreateDOM()
    Dim dom As New DOMDocument60
    With dom
        .async = False
        .validateOnParse = False
        .resolveExternals = False
    End With
    Set CreateDOM = dom
End Function

Private Function TransformString(xmlString$, xslPath$) As String
    Dim xml, xsl, out
    Set xml = CreateDOM
    xml.LoadXML xmlString
    Set xsl = CreateDOM
    xsl.Load xslPath
    out = xml.transformNode(xsl)
    TransformString = out
End Function

The calling(from immediate window):

?ProcessFile("c:\test.doc")

结果将如test.tex.c:\

该模块可能需要修复一些地方。如果是这样，请告诉我。

score 2 · Accepted Answer

Word 中的 Document 对象有一个 oMaths 集合，它代表文档中的所有 oMath 对象。oMath 对象包含 Functions 方法，该方法将返回 oMath 对象内的函数集合。所以，方程应该不是那么大的问题。

但是，我想您不仅要捕获下标和上标，还需要包含这些下标和上标的整个方程。这可能更具挑战性，因为您必须定义起点和终点。如果您要使用 .Find 方法查找下标，然后选择它之前的第一个空格字符和它之后的第一个空格字符之间的所有内容，这可能有效，但前提是您的方程不包含空格。

score 1 · Accepted Answer

这个 VBA sub 应该遍历文档中的每个文本字符，并在插入 LaTeX 表示法时删除上标和下标。

Public Sub LatexConversion()

Dim myRange As Word.Range, myChr
For Each myRange In ActiveDocument.StoryRanges
  Do
    For Each myChr In myRange.Characters

        If myChr.Font.Superscript = True Then
            myChr.Font.Superscript = False
            myChr.InsertBefore "^"
        End If

        If myChr.Font.Subscript = True Then
            myChr.Font.Subscript = False
            myChr.InsertBefore "_"
        End If

    Next
    Set myRange = myRange.NextStoryRange
  Loop Until myRange Is Nothing
Next
End Sub

如果某些方程式是使用 Word 的内置方程式编辑器或通过构建块 (Word 2010/2007) 创建的并且存在于内容控件中，则上述内容将不起作用。在执行上述操作之前，这些方程式将需要单独的 VBA 转换代码或手动转换为纯文本方程式。

score 1 · Accepted Answer

使用 Open XML SDK 将 OpenMath (OMath) 实现到 LaTex 的 C# 实现。从这里下载 MMLTEX XSL 文件http://sourceforge.net/projects/xsltml/

    public void OMathTolaTeX()
    {
        string OMath = "";
        string MathML = "";
        string LaTex = "";
        XslCompiledTransform xslTransform = new XslCompiledTransform();
        // The MML2OMML.xsl file is located under 
        // %ProgramFiles%\Microsoft Office\Office12\
        // Copy to Local folder
        xslTransform.Load(@"D:\OMML2MML.XSL");
        using (WordprocessingDocument wordDoc =
                  WordprocessingDocument.Open("test.docx", true))
        {
            OpenXmlElement doc = wordDoc.MainDocumentPart.Document.Body;

            foreach (var par in doc.Descendants<Paragraph>())
            {
               var math in par.Descendants<DocumentFormat.OpenXml.Math.Paragraph>().FirstOrDefault();
               File.WriteAllText("D:\\openmath.xml", math.OuterXml);
               OMath = math.OuterXml;

           }
        }
        //Load OMath string into stream
        using (XmlReader reader = XmlReader.Create(new StringReader(OMath)))
        {
            using (MemoryStream ms = new MemoryStream())
            {
                XmlWriterSettings settings = xslTransform.OutputSettings.Clone();

                // Configure xml writer to omit xml declaration.
                settings.ConformanceLevel = ConformanceLevel.Fragment;
                settings.OmitXmlDeclaration = true;

                XmlWriter xw = XmlWriter.Create(ms, settings);

                // Transform our MathML to OfficeMathML
                xslTransform.Transform(reader, xw);
                ms.Seek(0, SeekOrigin.Begin);

                StreamReader sr = new StreamReader(ms, Encoding.UTF8);

                MathML= sr.ReadToEnd();

                Console.Out.WriteLine(MathML);
                File.WriteAllText("d:\\MATHML.xml", MathML);
                // Create a OfficeMath instance from the
                // OfficeMathML xml.
                sr.Close();
                reader.Close();
                ms.Close();

                // Add the OfficeMath instance to our 
                // word template.

            }
        }
        var xmlResolver = new XmlUrlResolver();
        xslTransform = new XslCompiledTransform();
        XsltSettings xsltt = new XsltSettings(true, true);
        // The mmtex.xsl file is to convert to Tex 
        xslTransform.Load("mmltex.xsl", xsltt, xmlResolver);

        using (XmlReader reader = XmlReader.Create(new StringReader(MathML)))
        {
            using (MemoryStream ms = new MemoryStream())
            {
                XmlWriterSettings settings = xslTransform.OutputSettings.Clone();

                // Configure xml writer to omit xml declaration.
                settings.ConformanceLevel = ConformanceLevel.Fragment;
                settings.OmitXmlDeclaration = true;

                XmlWriter xw = XmlWriter.Create(ms, settings);

                // Transform our MathML to OfficeMathML
                xslTransform.Transform(reader, xw);
                ms.Seek(0, SeekOrigin.Begin);

                StreamReader sr = new StreamReader(ms, Encoding.UTF8);

                LaTex = sr.ReadToEnd();
                sr.Close();
                reader.Close();
                ms.Close();
                Console.Out.WriteLine(LaTex);
                File.WriteAllText("d:\\Latex.txt", LaTex);
                // Create a OfficeMath instance from the
                // OfficeMathML xml.


                // Add the OfficeMath instance to our 
                // word template.

            }
        }
    }

希望这对 C# 开发人员有所帮助。

vba - 如何以编程方式遍历 Word 文档中的下标、上标和方程式

4 回答 4

Related

Reference