c# - How to get data off of a character

Question

I am working on a project in Unity which uses Assembly C#. I try to get special character such as é, but in the console it just displays a blank character: "". For instance translating "How are you?" Should return "Cómo Estás?", but it returns "Cmo Ests". I put the return string "Cmo Ests" in a character array and realized that it is a non-null blank character. I am using Encoding.UTF8, and when I do:

char ch = '\u00e9';
print (ch);

It will print "é". I have tried getting the bytes off of a given string using:

byte[] utf8bytes = System.Text.Encoding.UTF8.GetBytes(temp);

While translating "How are you?", it will return a byte string, but for the special characters such as é, I get the series of bytes 239, 191, 189, which is a replacement character.

What type of information do I need to retrieve from the characters in order to accurately determining what character it is? Do I need to do something with the information that Google gives me, or is it something else? I am need a general case that I can place in my program and will work for any input string. If anyone can help, it would be greatly appreciated.

Here is the code that is referenced:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using UnityEngine;
using System.Collections;
using System.Net;
using HtmlAgilityPack;


public class Dictionary{
string[] formatParams;
HtmlDocument doc;
string returnString;
char[] letters;
public char[] charString;
public Dictionary(){
    formatParams = new string[2];
    doc = new HtmlDocument();
    returnString = "";
}

public string Translate(String input, String languagePair, Encoding encoding)
    {
        formatParams[0]= input;
        formatParams[1]= languagePair;
        string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", formatParams);

        string result = String.Empty;

        using (WebClient webClient = new WebClient())
        {
            webClient.Encoding = encoding;
            result = webClient.DownloadString(url);
        }       
        doc.LoadHtml(result);
        input = alter (input);
        string temp = doc.DocumentNode.SelectSingleNode("//span[@title='"+input+"']").InnerText;
        charString = temp.ToCharArray();
        return temp;
    }
// Use this for initialization
void Start () {

}
string alter(string inputString){
    returnString = "";
    letters = inputString.ToCharArray();
    for(int i=0; i<inputString.Length;i++){
        if(letters[i]=='\''){
            returnString = returnString + "&#39;";  
        }else{
            returnString = returnString + letters[i];   
        }
    }
    return returnString;
}
}

score 1 · Accepted Answer

也许您应该使用另一个 API/URL。下面的这个函数使用了一个不同的 url，它返回 JSON 数据并且似乎工作得更好：

    public static string Translate(string input, string fromLanguage, string toLanguage)
    {
        using (WebClient webClient = new WebClient())
        {
            string url = string.Format("http://translate.google.com/translate_a/t?client=j&text={0}&sl={1}&tl={2}", Uri.EscapeUriString(input), fromLanguage, toLanguage);
            string result = webClient.DownloadString(url);

            // I used JavaScriptSerializer but another JSON parser would work
            JavaScriptSerializer serializer = new JavaScriptSerializer();
            Dictionary<string, object> dic = (Dictionary<string, object>)serializer.DeserializeObject(result);
            Dictionary<string, object> sentences = (Dictionary<string, object>)((object[])dic["sentences"])[0];
            return (string)sentences["trans"];
        }
    }

如果我在控制台应用程序中运行它：

    Console.WriteLine(Translate("How are you?", "en", "es"));

它会显示

¿Cómo estás?

score 0 · Accepted Answer

你的方法有几个问题。首先UTF8编码是一种多字节编码。这意味着如果您使用任何非 ASCII 字符（字符代码 > 127），您将获得一系列特殊字符，向系统表明这是一个 Unicode 字符。所以实际上你的序列 239、191、189 表示单个字符，它不是 ASCII 字符。如果你使用 UTF16，那么你会得到固定大小的编码（2 字节编码），它实际上将一个字符映射到一个无符号的 short (0-65535)。

c#中的char类型是一个两字节的类型，所以它实际上是一个无符号的short。这与其他语言形成对比，例如 C/C++，其中 char 类型是 1 字节类型。

所以在你的情况下，除非你真的需要使用 byte[] 数组，否则你应该使用 char[] 数组。或者，如果您想对字符进行编码以便它们可以在 HTML 中使用，那么您只需遍历字符并检查字符代码是否 > 128，然后您可以将其替换为 &hex; 字符代码。

score 0 · Accepted Answer

我对 GoogleTranslate API 了解不多，但我的第一个想法是您遇到了 Unicode 规范化问题。

看看System.String.Normalize()它的朋友。

Unicode 非常复杂，所以我将过度简化！许多符号在 Unicode 中可以用不同的方式表示，即：“é”可以表示为“é”（一个字符），或者表示为“e”+“重音字符”（两个字符），或者，取决于出现的情况从 API 回来，完全是别的东西。

Normalize 函数会将您的字符串转换为具有相同文本含义的字符串，但可能是不同的二进制值，这可能会解决您的输出问题。

score 0 · Accepted Answer

我在我的一个项目中遇到了同样的问题 [语言资源本地化翻译]

我正在做同样的事情并且正在使用.. System.Text.Encoding.UTF8.GetBytes() 并且由于 utf8 编码正在接收特殊字符，例如结果字符串中的 239、191、189。

请看看我的解决方案...希望这会有所帮助

完全不要使用编码 Google 翻译将返回正确的 á ，因为它在字符串中是自我的。做一些字符串操作并按原样读取字符串......

通用解决方案[适用于谷歌支持的每种语言翻译]

try
{
    //Don't use UtF Encoding 
    // use default webclient encoding

    var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + txtNewResourceValue.Text.Trim() + "◄", "en|" + item.Text.Substring(0, 2));                    

     var webClient = new WebClient();
     string result = webClient.DownloadString(url); //get all data from google translate in UTF8 coding..

      int start = result.IndexOf("id=result_box");
      int end = result.IndexOf("id=spell-place-holder");
      int length = end - start;
      result = result.Substring(start, length);
      result = reverseString(result);

      start = result.IndexOf(";8669#&");//◄
      end = result.IndexOf(";8569#&");  //►
      length = end - start;

      result = result.Substring(start +7 , length - 8);
      objDic2.Text =  reverseString(result);

       //hard code substring; finding the correct translation within the string.
        dictList.Add(objDic2);
}
catch (Exception ex)
 {
  lblMessages.InnerHtml = "<strong>Google translate exception occured no resource   saved..." + ex.Message + "</strong>";
                error = true;
}

public static string reverseString(string s)
{
    char[] arr = s.ToCharArray();
    Array.Reverse(arr);
    return new string(arr);

}

正如您从代码中看到的那样，没有执行任何编码，我发送 2 个特殊键字符为“►”+ txtNewResourceValue.Text.Trim()+“◄”来确定从谷歌返回翻译的开始和结束。

我还检查了我的语言实用工具，我得到了“Cómo Estás？” 发送你好吗到谷歌翻译时... :)

最好的问候 [Shaz]

--------------------------已编辑---------- ---

公共字符串翻译（字符串输入，字符串语言对）{

    try
    {


        //Don't use UtF Encoding 
        // use default webclient encoding
        //input        [string to translate]
        //Languagepair [eg|es]

        var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + input.Trim() + "◄", languagePair);

        var webClient = new WebClient();
        string result = webClient.DownloadString(url); //get all data from google translate 

        int start = result.IndexOf("id=result_box");
        int end = result.IndexOf("id=spell-place-holder");
        int length = end - start;
        result = result.Substring(start, length);
        result = reverseString(result);

        start = result.IndexOf(";8669#&");//◄
        end = result.IndexOf(";8569#&");  //►
        length = end - start;

        result = result.Substring(start + 7, length - 8);

        //return transalted string
        return reverseString(result); 


    }
    catch (Exception ex)
    {
        return "Google translate exception occured no resource   saved..." + ex.Message";

    }
}

score 0 · Accepted Answer

你实际上几乎拥有它。只需插入带有 \u 的编码字母即可。

string mystr = "C\u00f3mo Est\u00e1s?";

c# - How to get data off of a character

5 回答 5

Related

Reference