java - 如何判断字符串是否使用不同的语言。（不是ASCII）

Question

我正在使用 JSoup 从不同的站点获取一些信息。该信息采用不同的语言，但使用了阿拉伯字符，例如کور。而且我不是 100% 确定，但我认为那些不是 ASCII 字符。我如何判断该字符串是否不是 ASCII（如果我是正确的，它不是）然后抓取该字符串。

编辑： 使用番石榴库和一段代码后，我得到以下输出。

首页新215

添加单词

统计数据

关于我们

反馈

回复

回复

خونه

سرای

سرپناه

带走

问题是，虽然打印的是非 ASCII 字符串，例如“کور”，但打印的是 ASCII 字符串，例如“Feedback”。

这是我正在使用的代码。

import java.io.IOException;
import java.io.PrintStream;
import java.io.UnsupportedEncodingException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.google.common.base.CharMatcher;

public class GrabLinks {

public static void main(String[] args) {

    Document doc;
    PrintStream out = null;
    try {
        out = new PrintStream(System.out, true, "UTF-8");
    } catch (UnsupportedEncodingException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    
    try {
        // need http protocol
        doc = Jsoup.connect("http://thepashto.com/word.php?pashto=&english=house").get();

        // get page title
        String title = doc.title();
        //System.out.println("title : " + title);

        // get all links
        Elements links = doc.select("a[href]");
        for (Element link : links) {

            // get the value from href attribute
            //System.out.println("\nlink : " + link.attr("href"));
            //System.out.println("text : " + link.text());

            if (!CharMatcher.ASCII.matchesAllOf(link.text())) {
                
                out.println(link.text());
            }
        }

    } catch (IOException e) { e.printStackTrace(); }
    
}
}

score 0 · Accepted Answer

如果您使用Google 的 Library Guava，您可以检查 aString是否为 ASCII 与 class CharMatcher.ASCII。

这是一个如何使用它的示例：

public static void main(String[] args) {
    System.out.println(isASCIIString("کور")); // false
    System.out.println(isASCIIString("Hi")); // true
}

public static boolean isASCIIString( String pString ) {
    return CharMatcher.ASCII.matchesAllOf(pString);
}

编辑：

使用此代码，您只能检查这是否为 ASCII。终端中的输出将不依赖于此，因为默认的 OutputStream 不支持这一点。System.out使用 MacRoman 字符集而不是 UTF-8 打印 Unicode 字符串。要打印您的角色，这可能会有所帮助：

PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(yourString);

java - 如何判断字符串是否使用不同的语言。（不是ASCII）

1 回答 1

Related

Reference