4

我有许多需要从中提取文本的 HTML 文件。如果它都在一行上,我可以很容易地做到这一点,但如果标签环绕或在多行上,我不知道如何做到这一点。这就是我的意思:

<section id="MySection">
Some text here
another line here <br>
last line of text.
</section>

我不关心<br>文本,除非它有助于环绕文本。我想要的区域始终以“MySection”开头,然后以</section>. 我想结束的是这样的:

Some text here  another line here  last line of text.

我更喜欢 vbscript 或命令行选项(sed?)之类的东西,但我不确定从哪里开始。有什么帮助吗?

4

2 回答 2

4

通常你会为此使用 Internet Explorer COM 对象:

root = "C:\base\dir"

Set ie = CreateObject("InternetExplorer.Application")

For Each f In fso.GetFolder(root).Files
  ie.Navigate "file:///" & f.Path
  While ie.Busy : WScript.Sleep 100 : Wend

  text = ie.document.getElementById("MySection").innerText

  WScript.Echo Replace(text, vbNewLine, "")
Next

但是,在<section>IE 9 之前不支持该标签,即使在 IE 9 中,COM 对象似乎也不能正确处理它,因为getElementById("MySection")只返回开始标签:

>>> wsh.echo ie.document.getelementbyid("MySection").outerhtml
<SECTION id=MySection>

不过,您可以改用正则表达式:

root = "C:\base\dir"

Set fso = CreateObject("Scripting.FileSystemObject")

Set re1 = New RegExp
re1.Pattern = "<section id=""MySection"">([\s\S]*?)</section>"
re1.Global  = False
re2.IgnoreCase = True

Set re2 = New RegExp
re2.Pattern = "(<br>|\s)+"
re2.Global  = True
re2.IgnoreCase = True

For Each f In fso.GetFolder(root).Files
  html = fso.OpenTextFile(filename).ReadAll

  Set m = re1.Execute(html)
  If m.Count > 0 Then
    text = Trim(re2.Replace(m.SubMatches(0).Value, " "))
  End If

  WScript.Echo text
Next
于 2013-05-18T23:53:55.363 回答
1

这是一个使用框架的单线解决方案perl和一个 HTML 解析器:Mojolicious

perl -MMojo::DOM -E '
    say Mojo::DOM->new( do { undef $/; <> } )->at( q|#MySection| )->text
' index.html

假设index.html有以下内容:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body id="portada">
<section id="MySection">
Some text here
another line here <br>
last line of text.
</section>
</body>
</html>

它产生:

Some text here another line here last line of text.
于 2013-05-18T22:16:44.580 回答