c++ - 在 C++ 中处理非 ASCII 字符

Question

我在 C++ 中遇到了一些非 Ascii 字符的问题。我有一个包含非 ascii 字符的文件，我正在通过文件处理在 C++ 中读取该文件。读取文件（比如 1.txt）后，我将数据存储到字符串流中并将其写入另一个文件（比如 2.txt）。

假设 1.txt 包含：

ação

在 2.txt 中，我应该得到相同的输出，但非 Ascii 字符被打印为 2.txt 中的十六进制值。

另外，我很确定 C++ 仅将 Ascii 字符作为 Ascii 处理。

请帮助如何在 2.txt 中正确打印这些字符

编辑：

首先是全过程的伪代码：

1.Shell script to Read from DB one Value and stores in 11.txt
2.CPP Code(a.cpp) reading 11.txt and Writing to f.txt

正在读取的数据库中存在的数据：Instalação

文件 11.txt 包含：InstalaÃ§Ã£o

文件 F.txt 包含：InstalaÃ§Ã£o

a.cpp 在屏幕上的输出：Instalação

a.cpp

#include <iterator>
#include <iostream>
#include <algorithm>
#include <sstream>
#include<fstream>
#include <iomanip>

using namespace std;
int main()
{
    ifstream myReadFile;
    ofstream f2;
    myReadFile.open("11.txt");
    f2.open("f2.txt");
    string output;
    if (myReadFile.is_open()) 
    {
        while (!myReadFile.eof())
        {
            myReadFile >> output;
                //cout<<output;

            cout<<"\n";

            std::stringstream tempDummyLineItem;
            tempDummyLineItem <<output;
            cout<<tempDummyLineItem.str();
            f2<<tempDummyLineItem.str();
        }
    }
    myReadFile.close();
    return 0;
}

语言环境是这样说的：

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

score 3 · Accepted Answer

至少如果我明白你在追求什么，我会做这样的事情：

#include <iterator>
#include <iostream>
#include <algorithm>
#include <sstream>
#include <iomanip>

std::string to_hex(char ch) {
    std::ostringstream b;
    b << "\\x" << std::setfill('0') << std::setw(2) << std::setprecision(2)
        << std::hex << static_cast<unsigned int>(ch & 0xff);
    return b.str();
}

int main(){
    // for test purposes, we'll use a stringstream for input
    std::stringstream infile("normal stuff. weird stuff:\x01\xee:back to normal");

    infile << std::noskipws;

    // copy input to output, converting non-ASCII to hex:
    std::transform(std::istream_iterator<char>(infile),
        std::istream_iterator<char>(),
        std::ostream_iterator<std::string>(std::cout),
        [](char ch) {
            return (ch >= ' ') && (ch < 127) ?
                std::string(1, ch) :
                to_hex(ch);
    });
}

score 0 · Accepted Answer

对我来说听起来像是一个 utf8 问题。由于您没有用 c++11 标记您的问题，这是一篇关于 unicode 和 c++ 流的优秀文章。

从您更新的代码中，让我解释发生了什么。您创建一个文件流来读取您的文件。在内部，文件流仅识别chars，除非您另有说明。A char，在大多数机器上，只能保存 8 位数据，但文件中的字符使用超过 8 位。为了能够正确读取您的文件，您需要知道它是如何编码的。最常见的编码是 UTF-8，chars每个字符使用 1 到 4。

一旦你知道你的编码，你可以使用 wifstream（用于 UTF-16）或imbue()其他编码的语言环境。

更新：如果您的文件是 ISO-88591（来自您上面的评论），试试这个。

wifstream myReadFile;
myReadFile.imbue(std::locale("en_US.iso88591"));
myReadFile.open("11.txt");

c++ - 在 C++ 中处理非 ASCII 字符

2 回答 2

Related

Reference