2

我正在处理看起来像是 MS Office 文档的字符串。请注意,在此示例中,有两个 BOM“字符”,一个位于字符串的开头,一个位于正文中。有时有几个字符,有时没有。在 Powershell 控制台中,它们打印为 ?

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=unicode"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
    <snip - bunch of style defs>
--></style></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1>
<p class=MsoNormal style='text-autospace:none'>
 <span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'></span>
 <span style='font-size:12.0pt;font-family:"Times New Roman","serif"'>Testing <o:p></o:p></span>
</p></div></body></html>

字符串来自一个对象,所以我不能简单地使用 Get-Content 强制 UTF8 编码。我还能如何剥离它们?我不担心这是有损的,因为这只是通过管道传输到显示器,因此希望去除多余的字符。我还将剥离 HTML。

4

3 回答 3

2

如果字符串中可能有其他实际的 UTF8 字符,另一种方法是走这条路线。它假设字节顺序标记字符位于每个字符串的开头:

$bytes = @()
$strs | Foreach {$bytes += [byte[]][char[]]$_}

$memStream = new-object system.io.memorystream
$memStream.Write($bytes, 0, $bytes.Length)
$memStream.Position = 0

$reader = new-object system.io.streamreader($memStream, [System.Text.Encoding]::UTF8)
$reader.ReadToEnd()
$reader.Dispose()
于 2013-02-14T19:49:19.513 回答
1

You should include the code you use to get your output when you ask for help. Does this work?

$s = #your code that gets the output#
$s -replace ""  #returns output without the characters

Or

( code that creates output ) -replace ""
于 2013-02-14T17:57:55.730 回答
1

这是我用来从源文件中删除嵌入的 UTF-8 BOM 字符的 PowerShell 脚本:

$files=get-childitem -Path . -Include @("*.h","*.cpp") -Recurse
foreach ($f in $files)
{
(Get-Content $f.PSPath) | 
Foreach-Object {$_ -replace "\xEF\xBB\xBF", ""} | 
Set-Content $f.PSPath
}
于 2015-02-25T18:20:14.023 回答