1

我尝试了一个简单的演示来检查 geograpy 是否可以做我正在寻找的事情:尝试在非规范化地址中找到国家名称和 iso 代码(这基本上就是 geograpy 的目的!)。
问题是,在我做的测试中,geograpy 能够为每个使用的地址找到几个国家,在大多数情况下包括正确的,但我找不到任何类型的参数来决定哪个国家最“正确” ”。
我使用的虚假地址列表可能反映了可以分析的现实,是这样的:

  • John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti
  • John Doe 160 Huntington Terrace 纽瓦克, 纽约 07112 美国
  • John Doe 30 Huntington Terrace Newark, New York 07112 USA
  • 约翰·多伊 22 Huntington Terrace Newark, New York 07112 US
  • Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia
  • Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy

这是编写的简单代码:

import geograpy

ind = ["John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti",
"John Doe 160 Huntington Terrace Newark, New York 07112 United States of America",
"John Doe 30 Huntington Terrace Newark, New York 07112 USA",
"John Doe 22 Huntington Terrace Newark, New York 07112 US",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia",
"Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy"]

locator = geograpy.locator.Locator()
for address in ind:
    places = geograpy.get_place_context(text=address)
    print(address)
    #print(places)
    for country in places.countries:
      print("Country:"+country+", IsoCode:"+locator.getCountry(name=country).iso)
    print()

这是输出:

John Doe 115 Huntington Terrace Newark, New York 07112 Stati Uniti
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US

John Doe 160 Huntington Terrace Newark, New York 07112 United States of America
Country:United States, IsoCode:US
Country:United Kingdom, IsoCode:GB
Country:Netherlands, IsoCode:NL
Country:Jamaica, IsoCode:JM
Country:Argentina, IsoCode:AR

John Doe 30 Huntington Terrace Newark, New York 07112 USA
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US

John Doe 22 Huntington Terrace Newark, New York 07112 US
Country:United Kingdom, IsoCode:GB
Country:Jamaica, IsoCode:JM
Country:United States, IsoCode:US

Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US

Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy
Country:Italy, IsoCode:IT
Country:Australia, IsoCode:AU
Country:Sweden, IsoCode:SE
Country:United States, IsoCode:US

首先,最大的问题是在意大利地址(第 4 号)中找不到完全正确的国家(意大利/意大利),我不知道找到的三个国家来自哪里。
在大多数情况下,它找到了错误的国家,沉迷于正确的国家,而且我没有任何类型的关于置信度百分比、距离或我能理解的指标,如果一个国家可以被认为是可以接受的答案并且,在多个结果中,什么可能是“最好的”

我想提前道歉,但我没有时间深入研究 geograpy3,我不知道这是否是一个愚蠢的问题,但我在文档中没有找到任何关于置信度/概率/距离的信息。

4

1 回答 1

0

我作为 geograpy3 的提交者回答。

看起来您在第一步尝试使用 geograpy Version1 的旧界面,然后才使用定位器。对于您的用例,改进的定位器界面可能更加合理。该界面可以使用人口或人均gdp等额外信息来找到“最有可能”的国家进行消歧。

Stati Uniti/United States Italia/Italy 问题是一个语言问题 - 请参阅 geograpy version1 的长期未决问题https://github.com/ushahidi/geograpy/issues/23。到今天为止,geograpy3 中似乎还没有新问题——如果您需要此改进,请随时提交。

我将您的示例添加到 geograpy3 项目中的 test_locator.py 以显示概念上的差异:

def testStackOverflow64379688(self):
        '''
        compare old and new geograpy interface
        '''
        examples=['John Doe 160 Huntington Terrace Newark, New York 07112 United States of America',
                  'John Doe 30 Huntington Terrace Newark, New York 07112 USA',
                  'John Doe 22 Huntington Terrace Newark, New York 07112 US',
                  'Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italia',
                  'Mario Bianchi, Via Nazionale 256, 00148 Roma (RM) Italy',
                  'Newark','Rome']
        for example in examples:
            city=geograpy.locateCity(example,debug=False)
            print(city)

结果:

None
None
None
None
None
Newark (US-NJ(New Jersey) - US(United States))
Rome (IT-62(Latium) - IT(Italy))
于 2020-10-21T08:38:27.993 回答