html - Powershell`-replace`正则表达式与换行符不匹配

Question

我正在尝试使用正则表达式清理一些 html 文件（是的，我看过帖子。我不希望通常解析 html）并且我想删除所有不包含标签的行。我的脚本如下：

Remove-Item $args[1]
$text = (Get-Content -Path $args[0] -Raw)
$text = $text -replace "^\s*\r?\n"
New-Item -Path $args[1] -ItemType File -Force -Value $text

我想替换很多其他东西，但我主要是在尝试修复

我可以验证内部正则表达式是否有效：VSCode（使用 JS 正则表达式而不是 powershell 的 .NET 正则表达式）使用提供的正则表达式正确匹配（并替换）有问题的行。

我知道Powershell 是 Special，所以我将的输出转换为Get-Content带有嵌入换行符的原始字符串。这没有帮助。

我可以验证其他函数（即remove-itemand new-item）是否工作正常，并且其他正则表达式可以通过将正则表达式文本从"^\s*\r?\n"to更改"p", "abc"并看到p标签都变成abc标签来工作。

此外，正则表达式\s*\r?\n有效，所以并不是正则表达式找不到换行符。

正则表达式\A\s*\r?\n也不起作用，这意味着它与 PowerShell 如何查找字符串的开头\结尾有关。

这是怎么回事？

<p>This is some text</p>

(the next line has a bunch of spaces)
               

<p>this is some more text</p>

作为参考，当使用 VSCode 的 JS 正则表达式引擎时（我相信类似于 PCRE），我的正则表达式应该（并且确实）匹配上述示例的第二、第四和第五行

最后，反编译正则表达式：

^         from the start of the string
 \s*      match any number of whitespaces
    \r?   possibly followed by a carriage return
       \n then a newline

score 4 · Accepted Answer

当你这样做

$text = (Get-Content -Path $args[0] -Raw)

你里面有行尾$text，你的正则表达式可以匹配它们。

锚点也可以匹配任何行的^开头，但是，需要使用一个特殊的标志：

$text = $text -replace '(?m)^\s*\n'

该\s模式涵盖了回车，无需担心它们并使用\r?.

解释

--------------------------------------------------------------------------------
  (?m)                     set flags for this block (with ^ and $
                           matching start and end of line) 
--------------------------------------------------------------------------------
  ^                        the beginning of a "line"
--------------------------------------------------------------------------------
  \s*                      whitespace (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \n                       '\n' (newline)

score 4 · Accepted Answer

Ryszard Czech 的有用答案很好地解释了您的方法存在的问题，并提供了有效的解决方案。

本质上，您希望从文件中消除空行或空白（全空白）行。

一个更简单（尽管速度较慢）的解决方案是利用Get-Content的默认逐行流，结合许多 PowerShell 运算符对输入数组进行操作的能力，在这种情况下，它们充当过滤器。

在这种情况下，您可以利用-match运算符（-Encoding根据需要进行调整）：

@(Get-Content -Path $args[0]) -match '\S' | Set-Content -Encoding UTF8 $args[1]

以上将 file$args[0]中包含至少一个非空白字符( \S) 的所有行传递到 to Set-Content，这会将过滤后的行保存到 target file $args[1]。

score 0 · Accepted Answer

诀窍是，您实际上没有超过一条可以匹配的行。

当您使用将文件转换为字符串时-Raw，您将其设为单行。^因此只会匹配文件的开头，因为这是正则表达式引擎可以找到的唯一字符串开头的标识符。

解决方法是匹配上一行末尾的换行符或匹配文件的开头，然后将其带到您的替换中，如下所示：

$text = $text -replace "(^|\n)\s*\r?\n","$1"

html - Powershell`-replace`正则表达式与换行符不匹配

3 回答 3

Related

Reference