0

我正在尝试向我的应用程序添加一个搜索框,它将根据输入的条件搜索共享驱动器。我目前拥有的代码是:

Public Sub searchProcedure()

    Dim startFolder As String = "C:\Documents and Settings\Practice Search"

    Dim dir As New System.IO.DirectoryInfo(startFolder)
    Dim fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories)

    Dim searchTerm = "test string"

    Dim queryMatchingFiles = From file In fileList _
                             Let fileText = GetFileText(file.FullName) _
                             Where fileText.Contains(searchTerm) _
                             Select file.FullName

    'Where file.Extension = "." _ (removed so searches all files)

    For Each filename In queryMatchingFiles
        ListBox1.Items.Add(filename)
    Next

End Sub


Function GetFileText(ByRef Name As String) As String

    Dim fileContents = String.Empty

    If System.IO.File.Exists(Name) Then

        fileContents = System.IO.File.ReadAllText(Name)

    End If

    Return fileContents

End Function

我遇到的问题是 Microsoft Office 文档。内容被读入我的 filecontents 字符串,但内容是 XML (?)。

关于如何将实际文本内容传递到我的搜索字符串中的任何想法?

谢谢!

4

4 回答 4

0

当内容是使用 Regex 的 XML 或 HTML 时,您可以完全去除标签

Regex.Replace(text, "<.*?>", "")

像这样:

Dim fileContents = String.Empty

If System.IO.File.Exists(Name) Then

    fileContents = System.IO.File.ReadAllText(Name)
    fileContents = Regex.Replace(fileContents, "<.*?>", "")
End If

Return fileContents
于 2013-03-19T23:52:17.467 回答
0

.docx 文件实际上是包含 XML 文件的 ZIP 文件。想到两个解决方案,都不容易:

  1. 如果您安装了 MS Word,请使用 Word 对象模型以编程方式打开 docx 文件并提取文本。使用MS Office 主互操作程序集(PIA) 更容易,但您只能使用特定版本的 Office。我更喜欢使用 PIA 进行开发,然后在最后切换到后期绑定(即将所有内容更改为“对象”并摆脱 PIA 引用)。

  2. 使用#ZipLib打开 .docx 文件,然后使用 System.Xml 命名空间将 XML 拉开。

选项1我认为对你来说会更容易。

于 2013-03-20T06:50:54.550 回答
0

得出没有“开箱即用”解决方案的结论;我正在处理每种文档类型。使用 OpenXML SDK,从 Word 中提取的代码是:

Imports System.Xml.XmlReader
Imports System.IO
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
Imports DocumentFormat.OpenXml.Spreadsheet
Imports DocumentFormat.OpenXml
Imports System.Linq

Public Sub WordProcessing()


    Dim strDoc As String = "C:\Documents and Settings\Practice.docx"
    Dim txt As String

    Dim stream As Stream = File.Open(strDoc, FileMode.Open)

    OpenAndAddtoWordProcessingStream(stream, txt)

    stream.Close()

    MessageBox.Show(txt)



End Sub

Public Sub OpenAndAddtoWordProcessingStream(ByVal stream As Stream, ByRef txt As String)


    Dim wordprocessingDocument As WordprocessingDocument = wordprocessingDocument.Open(stream, True)

    Dim body As Body = wordprocessingDocument.MainDocumentPart.Document.Body

    txt = body.InnerText.ToString

    wordprocessingDocument.Close()

End Sub

从 Excel 中提取的代码是:

  Dim strDoc As String = "C:\Documents and Settings\Practice.xlsx"
    Dim txt As String

    Dim spreadsheetDocument As SpreadsheetDocument = spreadsheetDocument.Open(strDoc, False)

    Dim workbookPart As WorkbookPart = spreadsheetDocument.WorkbookPart
    Dim shareStringPart As SharedStringTablePart = workbookPart.SharedStringTablePart


    For Each Item As SharedStringItem In shareStringPart.SharedStringTable.Elements(Of SharedStringItem)()

        MessageBox.Show(Item.InnerText)

    Next

接下来我将研究 .PDF、Access 和 Powerpoint。

于 2013-03-22T12:15:34.177 回答
0

我添加了这个,所以这个问题将在 SSS 的指导下得到完全的回答。这是用于搜索 office docs、office docs(x)、pdf 和其他通用文件格式的文本字符串的完整代码。

Imports System.IO
Imports System.Xml.XmlReader
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
Imports DocumentFormat.OpenXml.Spreadsheet
Imports DocumentFormat.OpenXml
Imports System.Linq
Imports System
Imports System.Collections.Generic
Imports A = DocumentFormat.OpenXml.Drawing
Imports DocumentFormat.OpenXml.Presentation
Imports System.Text
Imports iTextSharp.text
Imports iTextSharp.text.pdf

Module searchFiles

Public readAllText As String

Public Sub startSearch(ByVal searchText As String)

    MainForm.marketIntelligencelboxsearch.Items.Clear()

    Dim dir_info As New DirectoryInfo("\\Max1\dept\")

    ListFiles(MainForm.marketIntelligencelboxsearch, dir_info, searchText)

End Sub


Private Sub ListFiles(ByVal lst As ListView, ByVal dir_info As DirectoryInfo, ByVal target As String)
    ' Get the files in this directory.
    Dim fs_infos() As FileInfo = dir_info.GetFiles("*.*")
    For Each fs_info As FileInfo In fs_infos
        If target = "ALL" Or fs_info.ToString().IndexOf(target, StringComparison.OrdinalIgnoreCase) >= 0 Then
            MainForm.marketIntelligencelboxsearch.Items.Add(System.IO.Path.GetFileName(fs_info.FullName), MainForm.sourceFileImageIndex(fs_info.FullName))
        Else

            readAllText = File.ReadAllText(fs_info.FullName)

            If fileExtention(fs_info.FullName, target) <> 0 Then
                MainForm.marketIntelligencelboxsearch.Items.Add(System.IO.Path.GetFileName(fs_info.FullName), MainForm.sourceFileImageIndex(fs_info.FullName))
            End If
        End If
    Next fs_info
    fs_infos = Nothing

    ' Search subdirectories.
    Dim subdirs() As DirectoryInfo = dir_info.GetDirectories()
    For Each subdir As DirectoryInfo In subdirs
        ListFiles(lst, subdir, target)
    Next subdir
End Sub


Public Function fileExtention(ByVal sourcePath As String, ByVal target As String) As Integer

    Dim searchResult As Integer

    Select Case True

        Case InStr(sourcePath, ".docx") <> 0 Or InStr(sourcePath, ".docm")
            searchResult = WordProcessing(sourcePath, target)
            Return searchResult

        Case InStr(LCase(sourcePath), ".xlsx") <> 0 Or InStr(LCase(sourcePath), ".xlsm") <> 0
            searchResult = ExcelProcessing(sourcePath, target)
            Return searchResult

        Case InStr(LCase(sourcePath), ".pptx") <> 0 Or InStr(LCase(sourcePath), ".pptm") <> 0
            'will read slide text and notes
            searchResult = PowerpointProcessing(sourcePath, target)
            Return searchResult

        Case InStr(LCase(sourcePath), ".pdf") <> 0
            'will search text in pdf
            searchResult = pdfProcesssing(sourcePath, target)
            Return searchResult

        Case Else
            'looks at office docs before 2007 and all other generic  extensions, includes Access 2007 and lower
            searchResult = catchallProcessing(readAllText, target)
            Return searchResult
    End Select


End Function

区域“搜索索引”

Public Function catchallProcessing(ByVal strDoc As String, ByVal target As String) As Integer

    If Not (strDoc) Is Nothing Then
        If strDoc.IndexOf(target, StringComparison.OrdinalIgnoreCase) >= 0 Then 'means it ignores the case, no indexof = searching inside
            Return 1

        Else

            Return 0

        End If
    Else

        Return 0
    End If

End Function

结束区域

区域“Word 2007 处理”

Public Function WordProcessing(ByVal strDoc As String, ByVal target As String) As Integer  ' Word 2007 and Higher

    Dim txt As String

    Dim stream As Stream = File.Open(strDoc, FileMode.Open)

    Dim wordprocessingDocument As WordprocessingDocument = wordprocessingDocument.Open(stream, True)

    Dim body As Body = wordprocessingDocument.MainDocumentPart.Document.Body

    txt = body.InnerText.ToString
    Return catchallProcessing(txt, target) 'should return 0 or 1

    wordprocessingDocument.Close()
    stream.Close()

End Function

结束区域

区域“Excel 2007 处理”

Public Function ExcelProcessing(ByVal strDoc As String, ByVal target As String) As Integer 'Excel 2007 and Higher

    Dim spreadsheetDocument As SpreadsheetDocument = spreadsheetDocument.Open(strDoc, False)

    Dim workbookPart As WorkbookPart = spreadsheetDocument.WorkbookPart
    Dim shareStringPart As SharedStringTablePart = workbookPart.SharedStringTablePart

    Dim paragraphText As New StringBuilder()

    For Each Item As SharedStringItem In shareStringPart.SharedStringTable.Elements(Of SharedStringItem)()

        paragraphText.Append(Item.InnerText) 'should read all strings

    Next

    Return catchallProcessing(paragraphText.ToString(), target)

End Function

结束区域

区域“Powerpoint 2007 处理”

Public Function PowerpointProcessing(ByVal file As String, ByVal target As String) As Integer

    Dim numberOfSlides As Integer = CountSlides(file)

    Dim slideText As String = Nothing
    Dim totalText As String = Nothing

    For i As Integer = 0 To numberOfSlides - 1
        GetSlideIdandText(slideText, file, i)
        totalText = totalText & slideText
        'System.Console.WriteLine("Slide #{0} contains: {1}", i + 1, slideText)
    Next

    Return catchallProcessing(totalText, target)

End Function

Public Function CountSlides(ByVal presentationFile As String) As Integer

    Using powerpointDocument As PresentationDocument = PresentationDocument.Open(presentationFile, False)

        Return CountSlides(powerpointDocument)

    End Using


End Function

Public Function CountSlides(ByVal powerpointDocument As PresentationDocument) As Integer


    If powerpointDocument Is Nothing Then

        Throw New ArgumentNullException("presentationDocument")

    End If

    Dim slidesCount As Integer = 0

    Dim presentationPart As PresentationPart = powerpointDocument.PresentationPart

    If presentationPart IsNot Nothing Then

        slidesCount = presentationPart.SlideParts.Count()

    End If

    Return slidesCount

End Function

Public Function GetSlideIdandText(ByRef sldText As String, ByVal docName As String, ByVal index As Integer)


    Using ppt As PresentationDocument = PresentationDocument.Open(docName, False)

        Dim part As PresentationPart = ppt.PresentationPart
        Dim slideIDs As OpenXmlElementList = part.Presentation.SlideIdList.ChildElements
        Dim relID As String = TryCast(slideIDs(index), SlideId).RelationshipId


        Dim slide As SlidePart = DirectCast(part.GetPartById(relID), SlidePart)
        Dim notesSlide As NotesSlidePart = slide.NotesSlidePart
        Dim sn As NotesSlide = notesSlide.NotesSlide


        Dim textx As IEnumerable(Of A.Text) = sn.Descendants(Of A.Text)()
        Dim notesText As New StringBuilder()

        For Each text As A.Text In textx

            notesText.Append(text.Text)

        Next


        Dim paragraphText As New StringBuilder()

        Dim texts As IEnumerable(Of A.Text) = slide.Slide.Descendants(Of A.Text)()

        For Each text As A.Text In texts
            paragraphText.Append(text.Text)
        Next

        sldText = paragraphText.ToString() & notesText.ToString() 'concatenates the notes and slide text for searching

    End Using


End Function

结束区域

区域“PDF 处理”

Public Function pdfProcesssing(ByVal strDoc As String, ByVal target As String) As Integer


    Dim oReader As New iTextSharp.text.pdf.PdfReader(strDoc)
    Dim stringOut As StringBuilder = New StringBuilder()

    If File.Exists(strDoc) Then


        For i = 1 To oReader.NumberOfPages

            Dim itsText As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
            stringOut.Append(iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, itsText))

        Next


    End If

    Return catchallProcessing(stringOut.ToString(), target)

End Function

结束区域

End Module
于 2013-03-29T18:43:42.987 回答