1

我有 RTF 格式的大表,大小为 20-150 Mb。我首先尝试导出 RTF -> HTML -> 导入到 excel。一个 60 Mb 的文件大约需要 35 分钟。接下来,我尝试直接从 Word -> excel 复制表格。它总是在中途失败(所有内容都被粘贴,因为数据不在正确的单元格中)。

在确定方法之前,我尝试了更多方法(在转移到 excel 之前将所有单元格导入内存,以及其他排列,以及在和其他来源中详述的.ConvertToText方法)。

此方法相对较快,对于相同的 60 Mb 文件大约需要 25 分钟(这是不显示 Word、设置重新分页、事件、dispayupdate 和 tableautofit 为 false)。

考虑到这些文件可以在不到 10 秒的时间内完全加载到 RAM 内存中,我想知道为什么从 60 Mb 文件中读取数据需要 25 分钟。我知道由于更改为 HTML 格式,Word 中的表格引擎很慢,但是逐个单元格读取表格的速度非常慢。前几个单元格超级快,最后一个单元格速度较慢 - 我相信手动阅读比这更快。它违背了自动化的全部目的。但是,我别无选择。

代码是:

Dim oWord As Word.Application
Dim RTF As Word.Document

Set oWord = CreateObject("Word.Application")

Set RTF = oWord.Documents.Open(filename:=Fname, ConfirmConversions:=False, ReadOnly:=False)  ', ReadOnly:=True)
Application.StatusBar = vbNullString
Application.EnableEvents = False
Application.ScreenUpdating = False
Application.DisplayAlerts = False

With oWord
    Options.Pagination = False
    Options.AllowReadingMode = False
    Application.AutoRecover.Enabled = False
    Options.SaveInterval = 0
    Options.CheckGrammarAsYouType = False
    Options.CheckGrammarWithSpelling = False
End With

With RTF
    Options.Pagination = False
    Options.AllowReadingMode = False
    Application.AutoRecover.Enabled = False
    Options.SaveInterval = 0
    Options.CheckGrammarAsYouType = False
    Options.CheckGrammarWithSpelling = False
End With

Dim AAF As Table
For Each AAF In RTF.Tables
    AAF.AllowAutoFit = False
Next

oWord.Visible = False
Dim rng As Word.Range
Dim sData As String
Dim aData1() As String
Dim aData2() As String
Dim aDataAll() As String
Dim nrRecs As Long
Dim nrFields As Long
Dim lRecs As Long
Dim lFields As Long
Dim CTbl As Table 'Data Table
Dim oCell As Cell

'I'm not displaying the code which replaces all ^p with a spl character to maintain the table structure - it is staright forward, and does the job

Set rng = CTbl.ConvertToText(Separator:="$", NestedTables:=False)
sData = rng.Text    'This contains the entire table, delimited by vbCr and $...
Application.StatusBar = "Closing open files..."
RTF.Close (wdDoNotSaveChanges) 'All data has been extracted, hence quit word
oWord.Quit
Set oWord = Nothing

sData = Mid(sData, 1, Len(sData) - 1)
aData1() = Split(sData, vbCr)
nrRecs = UBound(aData1())

If Dbg Then MsgBox "The table contained " & nrRecs + 1 & " rows"

For lRecs = LBound(aData1()) To nrRecs  'Cycle through all rows
        aData2() = Split(aData1(lRecs), "$")    'Split rows into arrays
        Debug.Print aData1(lRecs)
        nrFields = UBound(aData2())         'Find out the number of columns
        If lRecs = LBound(aData1()) Then    'If this is the first row/cycle,
            ReDim Preserve aDataAll(nrRecs, 9)  'nrFields)   'Resize the array - currently I'm using a fixed size for the column since the first row of my table contains merged rows
        End If
        For lFields = LBound(aData2()) To nrFields  'Cycle through all columns
            aDataAll(lRecs, lFields) = aData2(lFields)  'Collate the data in a single array
            'If MsgBox(aDataAll(lRecs, lFields), vbYesNo, "Continue?") = vbNo Then Exit For
        Next
Next 'All of this was slapped together from MS code samples and stackoverflow examples

有什么提高性能的建议吗?

4

2 回答 2

1

如果您首先将表格(我假设有一个非常大的表格)拆分为较小的表格,然后将每个表格转换为文本,则转换会更快。

我在一个有 10000 行和 10 列的表上尝试了这个。转换为文本的时间从约 280 秒变为约 70 秒(即快 4 倍)。

为了简单起见,我直接从带有 10000 行表的文档中运行下面的代码(而不是从 Excel 中运行)。

Splt 然后转换:

Sub SplitThenConvert()
Dim t As Table
Set t = ActiveDocument.Tables.Item(1)
Dim rowCount As Integer
Dim index As Integer
Dim numSteps As Integer
Dim splitRow As Integer

Dim increment As Integer
Dim start_time, end_time

start_time = Now()

Application.ScreenUpdating = False

rowCount = t.Rows.Count
numSteps = 10

increment = rowCount / numSteps
splitRow = rowCount - increment

For index = 1 To numSteps
    Debug.Print "Split #" + CStr(index)
    ActiveDocument.Tables(1).Rows(splitRow).Select
    Selection.SplitTable
    splitRow = splitRow - increment
    If splitRow < increment Then
        Exit For
    End If
Next index
index = ActiveDocument.Tables.Count
While index > 0
    Debug.Print "Convert #" + CStr(index)
    ActiveDocument.Tables(index).ConvertToText ","
    index = index - 1
Wend

end_time = Now()

Application.ScreenUpdating = True

MsgBox (DateDiff("s", start_time, end_time))

End Sub

转换整个表而不拆分:

Sub ConvertAll()
Dim start_time, end_time

Application.ScreenUpdating = False
start_time = Now()

ActiveDocument.Tables(1).ConvertToText ","

end_time = Now()

Application.ScreenUpdating = True

MsgBox (DateDiff("s", start_time, end_time))

End Sub
于 2013-06-17T09:33:46.803 回答
0

我同意@KazJaw:读/写MS Office程序(包括.rtf,因为被视为Word)计算成本非常高,最好尽可能依赖其他方式(只需将.rtf文件读取转换为简单的 .txt 文件读取会大大提高速度)。我最近回答了关于这些行的帖子。我的另一个建议是尽可能减少“实时 Office 变量”的数量。而是同时创建RTFoDoc变量,最好一个接一个地进行(Excel 也是如此)。只有在特殊情况下(因为计算成本太高)才应该做的是在两个不同的实例(例如,两个不同的 Word 文档)之间实时复制/粘贴。

因此,使用与 Office 程序的连接来达到预期目的,即以非常复杂的方式对存储信息的文件进行顶级访问:填充值、更改格式、执行复杂的操作(例如,搜索整个文档) ; 但打算尽可能地减少迭代行为(例如,从一个单元格复制并一遍又一遍地粘贴到另一个单元格中)。以这种方式查看:在 .txt 文件中复制/粘贴仅涉及检查输入值/目标位置并执行操作;在 Word 中执行此操作涉及与在 .txt 文件中相同的内容,另外还要考虑在考虑每条记录时分析的大量变量(格式、对其他元素的引用、特殊操作等)。

于 2013-06-16T10:20:57.137 回答