html - 使用 powershell 检索 HTML 中的文本

Question

在这个 html 代码中：

<div id="ajaxWarningRegion" class="infoFont"></div>
  <span id="ajaxStatusRegion"></span>
  <form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" >
    <pre>
      Creating a new ZIP of IP Phone files from HTTP/PhoneBackup 
      and HTTPS/PhoneBackup
    </pre>
    <pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre>
    <pre>Reports Success</pre>
    <pre></pre>
    <a href =  /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>
      Download the new ZIP of IP Phone files
    </a>
  </div>

我想检索文本IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip或仅检索和之间的日期和IP_PHONE_BACKUP-时间.zip

我怎样才能做到这一点？

score 10 · Accepted Answer

使这个问题如此有趣的原因在于，HTML 看起来和闻起来都像 XML，后者由于其行为良好且有序的结构而更具可编程性。在理想世界中，HTML 将是 XML 的子集，但现实世界中的 HTML 显然不是XML。如果您将问题中的示例提供给任何 XML 解析器，它会因各种违规行为而犹豫不决。话虽如此，只需一行 PowerShell 即可实现所需的结果。这个返回href的整个文本：

Select-NodeContent $doc.DocumentNode "//a/@href"

而这个提取所需的子字符串：

Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip"

然而，问题在于能够运行那一行代码的开销/设置。你需要：

安装HtmlAgilityPack使 HTML 解析看起来像 XML 解析。
如果要解析实时网页，请安装PowerShell 社区扩展。
了解 XPath以便能够构建到目标节点的可导航路径。
了解正则表达式以便能够从目标节点中提取子字符串。

满足这些要求后，您可以将HTMLAgilityPath类型添加到您的环境并定义Select-NodeContent函数，如下所示。代码的最后显示了如何$doc为上述单行代码中使用的变量赋值。我将展示如何根据您的需要从文件或网络加载 HTML。

Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPath

function Select-NodeContent(
    [HtmlAgilityPack.HtmlNode]$node,
    [string] $xpath,
    [string] $regex,
    [Object] $default = "")
{
    if ($xpath -match "(.*)/@(\w+)$") {
        # If standard XPath to retrieve an attribute is given,
        # map to supported operations to retrieve the attribute's text.
        ($xpath, $attribute) = $matches[1], $matches[2]
        $resultNode = $node.SelectSingleNode($xpath)
        $text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
    }
    else { # retrieve an element's text
        $resultNode = $node.SelectSingleNode($xpath)
        $text = ?: { $resultNode } { $resultNode.InnerText } { $default }
    }
    # If a regex is given, use it to extract a substring from the text
    if ($regex) {
        if ($text -match $regex) { $text = $matches[1] }
        else { $text = $default }
    }
    return $text
}

$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html") # Use this to load a file
#$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this  PSCX cmdlet to load a live web page

score 1 · Accepted Answer

实际上，文件名周围的 HTML 在这里是无关紧要的。您可以使用以下正则表达式很好地提取日期（它甚至不关心您是从电子邮件、HTML 页面还是 CSV 文件中提取日期）：

(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)

快速测试：

PS> [regex]::Match($html, '(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)')

Groups   : {2012-Jul-25_15:47:47}
Success  : True
Captures : {2012-Jul-25_15:47:47}
Index    : 391
Length   : 20
Value    : 2012-Jul-25_15:47:47

score 0 · Accepted Answer

没有正则表达式：

$a = '<div id="ajaxWarningRegion" class="infoFont"></div><span id="ajaxStatusRegion"></span><form enctype="multipart/form-data" method="post" name="confIPBackupForm" action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup" id="confIPBackupForm" ><pre>Creating a new ZIP of IP Phone files from HTTP/PhoneBackup and HTTPS/PhoneBackup</pre><pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre><pre>Reports Success</pre><pre></pre><a href =  /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>Download the new ZIP of IP Phone files</a></div>'
$a.Substring($a.IndexOf("IP_PHONE_BACKUP")+"IP_PHONE_BACKUP".length+1, $a.IndexOf(".zip")-$a.IndexOf("IP_PHONE_BACKUP")-"IP_PHONE_BACKUP".length-1)

Substring为您提供原始字符串的一部分。第一个参数是子串的起始位置，第二个参数是所需子串的长度。所以现在你所要做的就是使用一点魔法来计算起点和IndexOf长度Length。

score 0 · Accepted Answer

以下正则表达式的 group(2) 和 group(3) 接受地包含日期和时间：

/IP_PHONE_BACKUP-((.*)_(.*)).zip/

这是从 powershell 中的正则表达式中提取值的链接。

有没有更短的方法可以将组从 Powershell 正则表达式中拉出来？

HIH

html - 使用 powershell 检索 HTML 中的文本

4 回答 4

Related

Reference