c# - 如何修复编码，当它应该是常规空格时，我得到 63 的 ascii 值

Question

在我的 c# 代码中，我从 pdf 中提取文本，但是它返回的文本有一些奇怪的字符，如果我在知道 pdf 文档中有文本“CLE action”时搜索“CLE action”，它给出我是假的，但是我发现提取文本后，两个单词之间的空格的ascii字节值为63...

有没有快速修复文本编码的方法？

目前我正在使用这种方法，但我认为它很慢并且只适用于那个角色。有没有适用于所有角色的快速方法？

    public static string fix_encoding(string src)
    {
        StringWriter return_str = new StringWriter();
        byte[] byte_array = Encoding.ASCII.GetBytes(src.Substring(0, src.Length));
        int len = byte_array.Length;
        byte byt;
        for(var i=0; i<len; i+=1)
        {
            byt = byte_array[i];
            if (byt == 63)
            {
                return_str.Write(" ");
            }
            else
            {
                return_str.Write(Encoding.ASCII.GetString(byte_array, i, 1));
            }
        }
        return return_str.ToString();
    }

这就是我调用此方法的方式：

                StringWriter output = new StringWriter();
                output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, page, new SimpleTextExtractionStrategy()));
                currentText = fix_encoding(output.ToString());

score 2 · Accepted Answer

您从 pdf 文件中提取的空格可能不是真正的空格（“”），而是在 unicode 中定义的其他类型的空格。例如“em 空格”或“不间断空格”，请参阅此列表或此处以获取概述。

如果提取的文本包含这样的空格，而您在文本中搜索一个普通空格，您将找不到它，因为它并不相同。

您的 fix_encoding 函数将字符串转换为 ASCII。ASCII 中不存在所有不寻常的空格。默认情况下，非 ASCII 字符被转换为问号。因此，在您的 fix_encoding 函数中，您会看到一个问号，即使原始文本具有不同的字符。

这意味着在您的 fix_encoding 函数中，您不应转换为 ASCII，而应将不寻常的空格替换为正常空格。以下函数将转换所有非 ASCII 字符，但您也可以使用Char.IsWhiteSpace来确定用普通空格替换哪些字符。

public static string remove_non_ascii(string src)
{
    return Regex.Replace(src, @"[^\u0000-\u007F]", " ");
}

c# - 如何修复编码，当它应该是常规空格时，我得到 63 的 ascii 值

1 回答 1

Related

Reference