3

在使用该类通过网络获取UTF-8-encoded 文件时,NSURLConnection委托的connection:didReceiveData:消息很可能会与NSData截断UTF-8文件一起发送 - 因为UTF-8它是一种多字节编码方案,并且单个字符可以在两个单独的NSData

换句话说,如果我加入我从中获得的所有数据,connection:didReceiveData:我将拥有一个有效的UTF-8文件,但每个单独的数据都是无效的UTF-8()。

我不想将所有下载的文件存储在内存中。

我想要的是:给定NSData,将任何你能解码的东西都解码成NSString. 如果 的最后几个字节NSData是未闭合的代理,请告诉我,以便我可以将它们保存到下一个NSData.

一个明显的解决方案是反复尝试使用 解码initWithData:encoding:,每次都截断最后一个字节,直到成功。不幸的是,这可能非常浪费。

4

3 回答 3

2

如果要确保不会在 UTF-8 多字节序列的中间停止,则需要查看字节数组的末尾并检查前 2 位。

  1. 如果最高位是 0,那么它是 ASCII 样式的未转义 UTF-8 代码之一,您就完成了。
  2. 如果最高位为 1 且倒数第二位为 0,则它是转义序列的延续,可能表示该序列的最后一个字节,因此您需要缓冲该字符以供稍后使用,然后查看前面的特点*
  3. 如果最高位是 1 并且倒数第二位也是 1,那么它是多字节序列的开始,您需要通过查找第一个 0 位来确定序列中有多少个字符。

查看维基百科条目中的多字节表:http ://en.wikipedia.org/wiki/UTF-8

// assumes that receivedData contains both the leftovers and the new data

unsigned char *data= [receivedData bytes];
UInteger byteCount= [receivedData length];

if (byteCount<1)
    return nil;  // or @"";

unsigned char *lastByte = data[byteCount-1];
if ( lastByte & 0x80 == 0) {
    NSString *newString = [NSString initWithBytes: data length: byteCount 
                                    encoding: NSUTF8Encoding];
    // verify success
    // remove bytes from mutable receivedData, or set overflow to empty
    return newString;
}

// now eat all of the continuation bytes
UInteger backCount=0;
while ( (byteCount > 0)  && (lastByte & 0xc0 == 0x80)) {
    backCount++;
    byteCount--;
    lastByte = data[byteCount-1];
}
// at this point, either we have exhausted byteCount or we have the initial character
// if we exhaust the byte count we're probably in an illegal sequence, as we should 
// always have the initial character in the receivedData

if (byteCount<1) {
    // error!
    return nil;
}

// at this point, you can either use just byteCount, or you can compute the 
// length of the sequence from the lastByte in order
// to determine if you have exactly the right number of characters to decode UTF-8.

UInteger requiredBytes = 0;
if (lastByte & 0xe0 == 0xc0) {  // 110xxxxx
    // 2 byte sequence
    requiredBytes= 1;
} else if (lastByte & 0xf0 == 0xe0) {   // 1110xxxx
    // 3 byte sequence
    requiredBytes= 2;
} else if (lastByte & 0xf8 == 0xf0) {   // 11110xxx
    // 4 byte sequence
    requiredBytes= 3;
} else if (lastByte & 0xfc == 0xf8) {   // 111110xx
    // 5 byte sequence
    requiredBytes= 4;
} else if (lastByte & 0xfe == 0xfc) {   // 1111110x
    // 6 byte sequence
    requiredBytes= 5;
 } else {
    // shouldn't happen, illegal UTF8 seq
 }

 // now we know how many characters we need and we know how many
 //  (backCount) we have, so either use them, or take the 
 // introductory character away.
 if (requiredBytes==backCount) {
     // we have the right number of bytes
     byteCount += backCount;
 } else { 
     // we don't have the right number of bytes, so remove the intro character 
     byteCount -= 1;   
 }

 NSString *newString = [NSString initWithBytes: data length: byteCount 
                                 encoding: NSUTF8Encoding];
 // verify success
 // remove byteCount bytes from mutable receivedData, or set overflow to the 
 // bytes between byteCount and [receivedData count]
 return newString;
于 2012-06-06T11:40:32.477 回答
0

UTF-8 是一种非常简单的解析编码,旨在使检测不完整序列变得容易,并且如果您从不完整序列的中间开始,则可以轻松找到它的开头。

从末尾向后搜索 <= 0x7f 或 > 0xc0 的字节。如果它 <= 0x7f,它就完成了。如果它在 0xc0 和 0xdf(含)之间,则需要后面的一个字节才能完成。如果它在 0xe0 和 0xef 之间,则需要两个后续字节才能完成。如果 >= 0xf0,则需要三个以下字节才能完成。

于 2012-06-06T11:21:13.217 回答
0

我有一个类似的问题 - 部分解码 utf8

  NSString * adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
    adsInfo->adsTopic = malloc(sizeof(char) * adsTopic.length + 1);
    strncpy(adsInfo->adsTopic, [adsTopic UTF8String], adsTopic.length + 1);

[已解决] 之后

  NSString *adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
    NSUInteger byteCount = [adsTopic lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
    NSLog(@"number of Unicode characters in the string topic == %lu",(unsigned long)byteCount);

    adsInfo->adsTopic = malloc(byteCount+1);
    strncpy(adsInfo->adsTopic, [adsTopic UTF8String], byteCount + 1);

    NSString *text=[NSString stringWithCString:adsInfo.adsTopic encoding:NSUTF8StringEncoding];
                NSLog(@"=== %@", text);
于 2016-01-26T15:04:17.490 回答