1

我有一个巨大的文本文件,其中发生了大量重复。重复如下。

帖子总数 16

Pin Code = GFDHG
TITLE = 商店标志/投影标志/工业标志/餐厅标志/伦敦菜单板和盒子
DATE = 12-09-2012
跟踪密钥 # 85265E712050-15207427406854753

帖子总数 16

Pin Code = GFDHG
TITLE = 商店标志/投影标志/工业标志/餐厅标志/伦敦菜单板和盒子
DATE = 12-09-2012
跟踪密钥 # 85265E712050-15207427406854753

帖子总数 2894

Pin Code = GFDHG
TITLE = 商店标志/投影标志/工业标志/餐厅标志/伦敦菜单板和盒子
DATE = 15-09-2012
跟踪密钥 # 85265E712050-152797637654753

帖子总数 2894

Pin Code = GFDHG
TITLE = 商店标志/投影标志/工业标志/餐厅标志/伦敦菜单板和盒子
DATE = 15-09-2012
跟踪密钥 # 85265E712050-152797637654753

依此类推,此文本文件中有多达 4000 个帖子。我希望我的程序将总帖子 6 与文件中出现的所有总帖子以及在哪里找到重复项进行匹配,然后以编程方式删除该重复项并删除该重复项的下 7 行。谢谢

4

1 回答 1

0

假设格式是一致的(即文件中的每个记录事件总共使用 6 行文本),那么如果您要从文件中删除重复项,您只需要执行以下操作:

Sub DupClean(ByVal fpath As String) 'fpath is the FULL file path, i.e. C:\Users\username\Documents\filename.txt
        Dim OrigText As String = ""
        Dim CleanText As String = ""
        Dim CText As String = ""
        Dim SReader As New System.IO.StreamReader(fpath, System.Text.Encoding.UTF8)
        Dim TxtLines As New List(Of String)
        Dim i As Long = 0
        Dim writer As New System.IO.StreamWriter(Left(fpath, fpath.Length - 4) & "_clean.txt", False) 'to overwrite the text inside the same file simply use StreamWriter(fpath)

        Try
            'Read in the text
            OrigText = SReader.ReadToEnd

            'Parse the text at new lines to allow selecting groups of 6 lines
            TxtLines.AddRange(Split(OrigText, Chr(10))) 'may need to change the Chr # to look for depending on if 10 or 13 is used when the file is generated
        Catch ex As Exception
            MsgBox("Encountered an error while reading in the text file contents and parsing them. Details: " & ex.Message, vbOKOnly, "Read Error")
            End
        End Try

        Try
            'Now we iterate through blocks of 6 lines 
            Do While i < TxtLines.Count
                'Set CText to the next 6 lines of text
                CText = TxtLines.Item(i) & Chr(10) & TxtLines.Item(i + 1) & Chr(10) & TxtLines.Item(i + 2) & Chr(10) & TxtLines.Item(i + 3) & Chr(10) & TxtLines.Item(i + 4) & Chr(10) & TxtLines.Item(i + 5)

                'Check if CText is already present in CleanText
                If Not (CleanText.Contains(CText)) Then
                    'Add CText to CleanText
                    If CleanText.Length = 0 Then
                        CleanText = CText
                    Else
                        CleanText = CleanText & Chr(10) & CText
                    End If
                End If 'else the text is already present and we don't need to do anything

                i = i + 6
            Loop
        Catch ex As Exception
            MsgBox("Encountered an error while running cleaning duplicates from the read in text. The application was on the " & i & "-th line of text when the following error was thrown: " & ex.Message, _
                   vbOKOnly, "Comparison Error")
            End
        End Try

        Try
            'Write out the clean text
            writer.Write(CleanText)
        Catch ex As Exception
            MsgBox("Encountered an error writing the cleaned text. Details: " & ex.Message & Chr(10) & Chr(10) & "The cleaned text was " & CleanText, vbOKOnly, "Write Error")
        End Try
    End Sub

如果格式不一致,您将需要更高级并定义规则来告诉在任何给定的循环中添加哪些行到 CText,但是如果没有上下文,我将无法给您任何关于什么的想法那些可能是。

于 2015-03-02T18:45:14.303 回答