python - 阅读包括俄语、韩语等多种语言的网页

Question

每个人。

对于我的研究项目，我收集了一些网页。

例如，http ://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3

正如你在上面的网页中看到的，提交者的名字不是英文的。

其他网页也有提交者的名字，用各种语言而不是英语写成。

以下代码用于处理提交者的姓名。

import csv
import re
import urllib

def get_page (link):
    k = 1
    while k == 1:
        try:
            f = urllib.urlopen (link)
            htmlSource = f.read()
            return htmlSource
        except EnvironmentError:
            print ('Error occured:', link)
        else:
            k = 2
    f.close()

def get_commit_info (commit_page):
    commit_page_string = str (commit_page)


    author_pattern = re.compile (r'<tr><th>author</th><td>(.*?)</td><td class=', re.DOTALL)
    t_author = author_pattern.findall (commit_page_string)

    t_author_string = str (t_author)
    author_point = re.search (" &lt;", t_author_string)
    author = t_author_string[:author_point.start()]

    print author

git_url = "http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3"
commit_page = get_page (git_url)
get_commit_info (commit_page)

'print author' 的结果如下：

\xd0\x9c\xd0\xb8\xd1\x80\xd0\xbe\xd1\x81\xd0\xbb\xd0\xb0\xd0\xb2 \xd0\x9d\xd0\xb8\xd0\xba\xd0\xbe\xd0 \xbb\xd0\xb8\xd1\x9b

如何准确打印名称？

score 0 · Accepted Answer

嗯......这会做你想要的

author = 'Мирослав Николић'
print author.decode('utf8') # Мирослав Николић

但如果编码不是 UTF8，它也将不起作用......

大多数事情使用utf8。大多。

Unicode 是一种复杂的东西，让你难以理解。'author' 是一个包含字节的字符串对象。这些字节中没有任何信息可以告诉您这些字节代表什么。绝对没有。你必须告诉 Python 这个字节串是 UTF8 中的代码点。对于您遇到的每个字节，在 UTF8 代码表中查找它，看看它代表的是哪个 UTF8 unicode 字形。

您可以通过查看元标记来检测每个页面的编码。在 html5 中，它们看起来像这样：

<meta charset="utf-8">.

python - 阅读包括俄语、韩语等多种语言的网页

1 回答 1

Related

Reference