html - ms访问解析html文件获取数据

Question

我已经编写了必要的代码来解析一个目录并找到所有的 html 文件。但是，我需要解析每个文件以获取必要的信息。我需要提取每个故事的故事标题、作者、类别、章节数、来源和摘要，并将它们添加到数据库中的正确字段中。每个 html 文件都以相同的方式设置。如果可能的话，我也想为每个故事计算字数。字数将是每个 CHAPTER TEXT 区域中所有单词的总和。下面是每个 html 文件的编写方式的概要。请让我知道实现这一目标的最佳方法。

<html>
<head>
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
    <meta name="author" content="AUTHOR">
    <title>AUTHOR: TITLE</title>
</head>
<body>
    <br/><br/>
    <div style="text-align:center">
        <h1>TITLE</h1>
    </div>
    <b>Story:</b> TITLE<br>
    <b>Storylink:</b> <a href="URL">URL
    <b>Category:</b> CATEGORY<br>
    <b>Author:</b> AUTHOR<br/>
    <b>Last updated:</b> 10/16/2011<br/>
    <b>Status:</b> STATUS<br/>
    <b>Content:</b> Chapter 1 to 16 of 16 chapters<br/>
    <b>Source:</b> SOURCE<br><br>
    <b>Summary:</b> SUMMARY

    <!--CHAPTERAREA START-->
        <h2 class=chapterffdl>*Chapter 1*: Chapter 1</h2>
        CHAPTER TEXT CHAPTER TEXT CHAPTER TEXT

        <h2 class=chapterffdl>*Chapter 2*: Chapter 2</h2>
        CHAPTER TEXT CHAPTER TEXT CHAPTER TEXT

        ...

    <!--CHAPTERAREA STOP-->

</body>
</html>

score 0 · Accepted Answer

一个开始：

Sub ParseHTML()
''Requires that you add a reference to the
''Windows Script Host Object Model
''Use Tools->References

Dim fs As New FileSystemObject
Dim fl As Folder
Dim f As File
Dim ts As TextStream
Dim sList As Variant
Dim rs As Recordset

''s="create table stories (story text,author text,category text,content text,source text,summary text)
''CurrentDb.Execute S

Set rs = CurrentDb.OpenRecordset("stories")

sList = Split("story,author,category,content,source,summary", ",")

Set fl = fs.GetFolder("Z:\docs\")
For Each f In fl.Files
    If f.Type Like "*HTML*" Then
        Debug.Print f.Name
        Set ts = fs.OpenTextFile(f.Path, ForReading)
        Do While Not ts.AtEndOfStream
            a = ts.ReadLine
            For i = 0 To UBound(sList)
                If Left(Trim(a), Len(sList(i)) + 4) = "<b>" & sList(i) & ":"
                    If Trim(a) Like "<b>Story:*" Then
                        rs.AddNew
                    End If
                    rs(sList(i)) = Trim(a)
                    If Trim(a) Like "<b>Summary:*" Then
                        rs.Update
                        Exit Do
                    End If
                End If
            Next
        Loop
    End If
Next

End Sub

html - ms访问解析html文件获取数据

1 回答 1

Related

Reference