1

我刚刚安装了数据工具包并从命令行对文件和几个网页运行 text2people。

作为输出,我得到类似的东西

Peter Williams,Peter,Williams,,m,151431,151445,stdin
David Philippaerts,David,Philippaerts,,m,152500,152518,stdin
Da Ryse,Da,Ryse,,m,158551,158558,stdin

我可以猜到第一个字段是姓名、姓氏和性别,但我不明白如何获取网站中显示的其他信息,例如种族。我应该通过 python/javascript 等使用它吗?帮助和文档真的很少......

4

1 回答 1

1

下载并解压 python_tools.zip。如果您将库安装到您的操作系统,您可以在您想要的位置创建程序,否则您可以将测试程序写入 dstk.py 所在的目录。

这是一个简单的测试程序。它有一个从服务中获取信息的人员列表。然后它将检查他们的种族信息并打印出他们最可能的种族及其百分比。

import dstk
from pprint import pprint

dstk = dstk.DSTK()

# List of people you want to search for
people_names = ["Samuel L. Jackson", "Michelle Yeoh", "Danny Trejo", "Vanessa Minnillo","Naomi Campbell","Chuck Norris"]

# Query information for each person in the list
people = dstk.text2people(",".join(people_names))

# Print the structure of the received information
#print people

# Prints the structure of the people in more readable way
#pprint(people)

# Print name and ethnicity information of person
for person in people:

    if person['ethnicity'] == None:
        print (person['first_name'] + " " + person['surnames']).ljust(26), "Unknown ethnicity"
    else:
        ethnics = ['percentage_american_indian_or_alaska_native','percentage_asian_or_pacific_islander','percentage_black','percentage_hispanic','percentage_two_or_more','percentage_white']
        highest_probability = 0
        highest_index = 0

        # Find highest percentage
        for eth_index in ethnics:
            if person['ethnicity'][eth_index] > highest_probability:
                highest_probability = person['ethnicity'][eth_index]
                highest_index = eth_index
        print (person['first_name'] + " " + person['surnames']).ljust(20), str(person['ethnicity'][highest_index]).ljust(5), highest_index

上面的代码将打印以下内容:

Samuel L Jackson     53.02 percentage_black
Michelle Yeoh        87.74 percentage_asian_or_pacific_islander
Danny Trejo          94.15 percentage_hispanic
Vanessa Minnillo           Unknown ethnicity
Naomi Campbell       76.47 percentage_white
Chuck Norris         82.01 percentage_white

您可以通过打印从服务器接收到的结构来查看变量的名称(pprint(people)),并且名称非常明显。

我很难找到任何可以被视为多种族或美洲印第安人的人。数据库似乎坚持认为他们是白人。

于 2013-06-11T12:57:58.867 回答