3

我有大约。excel 中包含 RTF 的 12000 个单元格(包括格式标记)。我需要解析它们以获取未格式化的文本。

这是带有文本的单元格之一的示例:

{\rtf1\ansi\deflang1060\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset238
Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs24\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw11908\paperh16833\margl1800\margr1800\margt1440\margb1440\headery720\footery720
\deftab720\formshade\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn11908\pghsxn16833\marglsxn1800\margrsxn1800\margtsxn1440\margbsxn1440
\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs24\pard TPR 0160 000\par IPR 0160 000\par OB-R-02-28\par}

我真正需要的是:

TPR 0160 000
IPR 0160 000
OB-R-02-28

简单循环单元格并删除不必要的格式的问题在于,并不是这 12000 个单元格中的所有内容都像这样简单。所以我需要手动检查许多不同的版本并编写几个变体;最后仍然会有很多手工工作要做。

但是,如果我将一个单元格的内容复制到空文本文档并将其另存为 RTF,然后用 MS Word 打开它,它会立即解析文本并得到我想要的。不幸的是,对于 12000 个单元,这样做非常不方便。

所以我在考虑 VBA 宏,将单元格内容移动到 Word,强制解析,然后将结果复制回原始单元格。不幸的是,我不确定该怎么做。

有人知道吗?还是另一种方法?我将非常感谢解决方案或朝着正确方向的推动。

天呐!

4

4 回答 4

7

如果您确实想走使用 Word 解析文本的路线,这个功能应该可以帮助您。正如评论所暗示的,您需要参考 MS Word 对象库。

Function ParseRTF(strRTF As String) As String
Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'
Dim f     As Integer       'Variable to store the file I/O number'

'File path for a temporary .rtf file'
Const strFileTemp = "C:\TempFile_ParseRTF.rtf"

'Obtain the next valid file I/O number'
f = FreeFile

'Open the temp file and save the RTF string in it'
Open strFileTemp For Output As #f
    Print #f, strRTF
Close #f

'Open the .rtf file as a Word.Document'
Set wdDoc = GetObject(strFileTemp)

'Read the now parsed text from the Word.Document'
ParseRTF = wdDoc.Range.Text

'Delete the temporary .rtf file'
Kill strFileTemp

'Close the Word connection'
wdDoc.Close False
Set wdDoc = Nothing
End Function

您可以使用类似于以下内容的方式为 12,000 个单元中的每一个调用它:

Sub ParseAllRange()
Dim rngCell As Range
Dim strRTF  As String

For Each rngCell In Range("A1:A12000")

    'Parse the cell contents'
    strRTF = ParseRTF(CStr(rngCell))

    'Output to the cell one column over'
    rngCell.Offset(0, 1) = strRTF
Next
End Sub

ParseRTF 函数运行大约需要一秒钟(至少在我的机器上),因此对于 12,000 个单元,这将在大约三个半小时内运行。


在周末考虑过这个问题后,我确信有一个更好(更快)的解决方案。

我记得剪贴板的 RTF 功能,并意识到可以创建一个类,将 RTF 数据复制到剪贴板,粘贴到 word doc,然后输出结果纯文本。这种解决方案的好处是不必为每个 rtf 字符串打开和关闭 word doc object;它可以在循环之前打开并在之后关闭。

下面是实现这一点的代码。它是一个名为 clsRTFParser 的类模块。

Private Declare Function GlobalAlloc Lib "kernel32" _
                (ByVal wFlags&, ByVal dwBytes As Long) As Long
Private Declare Function GlobalLock Lib "kernel32" _
                (ByVal hMem As Long) As Long
Private Declare Function GlobalUnlock Lib "kernel32" _
                (ByVal hMem As Long) As Long
Private Declare Function lstrcpy Lib "kernel32" _
                (ByVal lpString1 As Any, ByVal lpString2 As Any) As Long

Private Declare Function OpenClipboard Lib "user32" _
                (ByVal Hwnd As Long) As Long
Private Declare Function EmptyClipboard Lib "user32" () As Long
Private Declare Function RegisterClipboardFormat Lib "user32" Alias _
                "RegisterClipboardFormatA" (ByVal lpString As String) As Long
Private Declare Function SetClipboardData Lib "user32" _
                (ByVal wFormat As Long, ByVal hMem As Long) As Long
Private Declare Function CloseClipboard Lib "user32" () As Long

'---'

Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'

Private Sub Class_Initialize()
Set wdDoc = New Word.Document
End Sub

Private Sub Class_Terminate()
wdDoc.Close False
Set wdDoc = Nothing
End Sub

'---'

Private Function CopyRTF(strCopyString As String) As Boolean
Dim hGlobalMemory  As Long
Dim lpGlobalMemory As Long
Dim hClipMemory    As Long
Dim lngFormatRTF   As Long

'Allocate and copy string to memory'
hGlobalMemory = GlobalAlloc(&H42, Len(strCopyString) + 1)
lpGlobalMemory = GlobalLock(hGlobalMemory)
lpGlobalMemory = lstrcpy(lpGlobalMemory, strCopyString)

'Unlock the memory and then copy to the clipboard'
If GlobalUnlock(hGlobalMemory) = 0 Then
    If OpenClipboard(0&) <> 0 Then
        Call EmptyClipboard

        'Save the data as Rich Text Format'
        lngFormatRTF = RegisterClipboardFormat("Rich Text Format")
        hClipMemory = SetClipboardData(lngFormatRTF, hGlobalMemory)

        CopyRTF = CBool(CloseClipboard)
    End If
End If
End Function

'---'

Private Function PasteRTF() As String
Dim strOutput As String

'Paste the clipboard data to the wdDoc and read the plain text result'
wdDoc.Range.Paste
strOutput = wdDoc.Range.Text

'Get rid of the new lines at the beginning and end of the document'
strOutput = Left(strOutput, Len(strOutput) - 2)
strOutput = Right(strOutput, Len(strOutput) - 2)

PasteRTF = strOutput
End Function

'---'

Public Function ParseRTF(strRTF As String) As String
If CopyRTF(strRTF) Then
    ParseRTF = PasteRTF
Else
    ParseRTF = "Error in copying to clipboard"
End If
End Function

您可以使用类似于以下内容的方式为 12,000 个单元中的每一个调用它:

Sub CopyParseAllRange()
Dim rngCell As Range
Dim strRTF  As String

'Create new instance of clsRTFParser'
Dim RTFParser As clsRTFParser
Set RTFParser = New clsRTFParser

For Each rngCell In Range("A1:A12000")

    'Parse the cell contents'
    strRTF = RTFParser.ParseRTF(CStr(rngCell))

    'Output to the cell one column over'
    rngCell.Offset(0, 1) = strRTF
Next
End Sub

我在我的机器上使用示例 RTF 字符串对此进行了模拟。对于 12,000 个单元,需要两分半钟,这是一个更合理的时间框架!

于 2009-11-17T10:22:18.743 回答
2

您可以尝试使用正则表达式解析每个单元格,只留下您需要的内容。

每个 RTF 控制代码都以“\”开头并以空格结尾,中间没有任何额外的空格。“{}”用于分组。如果您的文本不包含任何内容,您可以删除它们(“;”也是如此)。所以现在你保留你的文本和一些不必要的词,如“Arial”、“Normal”等。你也可以构建字典来删除它们。经过一些调整后,您将只保留所需的文本。

查看http://www.regular-expressions.info/了解更多信息和编写 RegExp 的好工具(RegexBuddy - 不幸的是它不是免费的,但物有所值。AFAIR 还有试用版)。

更新:当然,我不鼓励您为每个单元手动执行此操作。只需遍历活动范围:请参阅此线程: SO:关于遍历 VBA 中的单元格

就个人而言,我会尝试这个想法:

Sub Iterate()
   For Each Cell in ActiveSheet.UsedRange.Cells
      'Do something
   Next
End Sub

以及如何在 VBA (Excel) 中使用 RegExp?

参考: Excel 中的 Regex 函数VBA中的 Regex

基本上你必须通过 COM 使用 VBScript.RegExp 对象。

于 2009-11-04T11:19:23.353 回答
1

这里的一些解决方案需要参考 MS Word 对象库。玩弄我的牌,我找到了一个不依赖它的解决方案。它在 VBA 中去除 RTF 标记和其他绒毛,如字体表和样式表。它可能对你有帮助。我在您的数据中运行它,除了空格之外,我得到的输出与您预期的相同。

这是代码。

首先,检查字符串是否为字母数字。给它一个长度为一个字符的字符串。该函数用于在这里和那里进行定界。

Public Function Alphanumeric(Character As String) As Boolean
   If InStr("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-", Character) Then
       Alphanumeric = True
   Else
       Alphanumeric = False
   End If
End Function

接下来是删除整个组。我用它来删除字体表和其他垃圾。

Public Function RemoveGroup(RTFString As String, GroupName As String) As String
    Dim I As Integer
    Dim J As Integer
    Dim Count As Integer

    I = InStr(RTFString, "{\" & GroupName)

    ' If the group was not found in the RTF string, then just return that string unchanged.
    If I = 0 Then
        RemoveGroup = RTFString
        Exit Function
    End If

    ' Otherwise, we will need to scan along, from the start of the group, until we find the end of the group.
    ' The group is delimited by { and }. Groups may be nested, so we need to count up if we encounter { and
    ' down if we encounter }. When that count reaches zero, then the end of the group has been found.
    J = I
    Do
        If Mid(RTFString, J, 1) = "{" Then Count = Count + 1
        If Mid(RTFString, J, 1) = "}" Then Count = Count - 1
        J = J + 1
    Loop While Count > 0

    RemoveGroup = Replace(RTFString, Mid(RTFString, I, J - I), "")

End Function

好的,这个函数会删除所有标签。

Public Function RemoveTags(RTFString As String) As String
    Dim L As Long
    Dim R As Long
    L = 1
    ' Search to the end of the string.
    While L < Len(RTFString)
        ' Append anything that's not a tag to the return value.
        While Mid(RTFString, L, 1) <> "\" And L < Len(RTFString)
            RemoveTags = RemoveTags & Mid(RTFString, L, 1)
            L = L + 1
        Wend
    
        'Search to the end of the tag.
        R = L + 1
        While Alphanumeric(Mid(RTFString, R, 1)) And R < Len(RTFString)
            R = R + 1
        Wend
        L = R
    Wend
End Function

我们可以用明显的方式删除花括号:

Public Function RemoveBraces(RTFString As String) As String
    RemoveBraces = Replace(RTFString, "{", "")
    RemoveBraces = Replace(RemoveBraces, "}", "")
End Function

将上述函数复制粘贴到模块中后,您可以创建一个函数,使用它们删除您不需要或不想要的任何东西。以下在我的情况下非常有效。

Public Function RemoveTheFluff(RTFString As String) As String
    RemoveTheFluff = Replace(RTFString, vbCrLf, "")
    RemoveTheFluff = RemoveGroup(RemoveTheFluff, "fonttbl")
    RemoveTheFluff = RemoveGroup(RemoveTheFluff, "colortbl")
    RemoveTheFluff = RemoveGroup(RemoveTheFluff, "stylesheet")
    RemoveTheFluff = RemoveTags(RemoveBraces(RemoveTheFluff))
End Function

我希望这有帮助。我不会在文字处理器或任何东西中使用它,但如果你正在这样做,它可能会用于抓取数据。

于 2016-07-12T11:26:49.610 回答
0

您的帖子听起来好像每个 RTF 文档都存储在一个 Excell 单元格中。如果是这样,那么

使用 .Net Framework RichTextBox 控件的解决方案

将在 2 行代码中将每个单元格中的 RTF 转换为纯文本(经过一些系统配置以获得正确的 .tlb 文件以允许引用 .Net Framework)。将单元格值放入rtfsample

Set miracle = New System_Windows_Forms.RichTextBox
With miracle
    .RTF = rtfText
    PlainText = .TEXT
End With
于 2018-02-04T07:19:46.403 回答