使这个问题如此有趣的原因在于,HTML 看起来和闻起来都像 XML,后者由于其行为良好且有序的结构而更具可编程性。在理想世界中,HTML 将是 XML 的子集,但现实世界中的 HTML 显然不是XML。如果您将问题中的示例提供给任何 XML 解析器,它会因各种违规行为而犹豫不决。话虽如此,只需一行 PowerShell 即可实现所需的结果。这个返回href的整个文本:
Select-NodeContent $doc.DocumentNode "//a/@href"
而这个提取所需的子字符串:
Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip"
然而,问题在于能够运行那一行代码的开销/设置。你需要:
满足这些要求后,您可以将HTMLAgilityPath
类型添加到您的环境并定义Select-NodeContent
函数,如下所示。代码的最后显示了如何$doc
为上述单行代码中使用的变量赋值。我将展示如何根据您的需要从文件或网络加载 HTML。
Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPath
function Select-NodeContent(
[HtmlAgilityPack.HtmlNode]$node,
[string] $xpath,
[string] $regex,
[Object] $default = "")
{
if ($xpath -match "(.*)/@(\w+)$") {
# If standard XPath to retrieve an attribute is given,
# map to supported operations to retrieve the attribute's text.
($xpath, $attribute) = $matches[1], $matches[2]
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
}
else { # retrieve an element's text
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.InnerText } { $default }
}
# If a regex is given, use it to extract a substring from the text
if ($regex) {
if ($text -match $regex) { $text = $matches[1] }
else { $text = $default }
}
return $text
}
$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html") # Use this to load a file
#$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this PSCX cmdlet to load a live web page