powershell - 使用 PowerShell 计算文件中的句子

Question

我的 PowerShell 程序在计算我正在使用的文件中的句子数时遇到问题。我正在使用以下代码：

foreach ($Sentence in (Get-Content file))
{
    $i = $Sentence.Split("?")
    $n = $Sentence.Split(".")
    $Sentences += $i.Length
    $Sentences += $n.Length
}

我应该得到的句子总数是 61，但我得到 71，有人可以帮我解决这个问题吗？我也将句子设置为零。

谢谢

score 0 · Accepted Answer

在计算句子时，您要寻找的是每个句子的结束位置。但是，拆分会返回围绕这些结尾字符的句子片段的集合，结尾本身由元素之间的间隙表示。因此，句子的数量将等于间隙的数量，即拆分结果中的片段数量减一。

当然，正如Keith Hill在上面的评论中指出的那样，当您可以直接计算末端时，实际的拆分是不必要的。

foreach( $Sentence in (Get-Content test.txt) ) {
  # Split at every occurrence of '.' and '?', and count the gaps.
  $Split = $Sentence.Split( '.?' )
  $SplitSentences += $Split.Count - 1

  # Count every occurrence of '.' and '?'.
  $Ends = [char[]]$Sentence -match '[.?]'
  $CountedSentences += $Ends.Count
}

文件内容test.txt：

Is this a sentence? This is a 
sentence. Is this a sentence? 
This is a sentence. Is this a
very long sentence that spans
multiple lines?

此外，为了澄清对Vasili 回答的评论：PowerShell-split运算符默认将字符串解释为正则表达式，而 .NETSplit方法仅适用于文字字符串值。

例如：

'Unclosed [bracket?' -split '[?]'将[?]视为正则表达式字符类并匹配?字符，返回两个字符串'Unclosed [bracket'和''
'Unclosed [bracket?'.Split( '[?]' )将调用Split(char[])重载并匹配每个[, ?, 和]字符，返回三个字符串'Unclosed ', 'bracket', 和''

score 0 · Accepted Answer

foreach ($Sentence in (Get-Content file))
{
    $i = $Sentence.Split("[?\.]")
    $Sentences = $i.Length
}

我稍微编辑了你的代码。

您使用的.需要转义，否则 Powershell 会将其识别为正则dotall表达式，这意味着“任何字符”

因此，您应该将字符串拆分"[?\.]"或类似。

powershell - 使用 PowerShell 计算文件中的句子

2 回答 2

Related

Reference