database - 有什么好的姓氏数据库吗？

Question

我正在寻找生成一些数据库测试数据，特别是包含人名的表列。为了更好地了解索引在基于名称的搜索方面的工作情况，我希望尽可能接近真实世界的名称及其真实的频率分布，例如许多不同的名称，其频率分布在某些幂律分布上。

理想情况下，我正在寻找一个免费提供的数据文件，其名称后跟每个名称的单个频率值（或等效的概率）。

基于盎格鲁-撒克逊的名字会很好，尽管来自其他文化的名字也很有用。

score 5 · Accepted Answer

我找到了一些符合要求的美国人口普查数据。唯一需要注意的是，它只列出至少出现 100 次的名称......

通过此博客条目找到，该条目还显示了幂律分布曲线

姓氏中的幂律曲线（博客条目）

除此之外，您可以使用轮盘选择从列表中取样，例如（未测试）

struct NameEntry
{
    public string _name;
    public int _frequency;
}

int _frequencyTotal; // Precalculate this.


public string SampleName(NameEntry[] nameEntryArr, Random rng)
{
    // Throw the roulette ball.
    int throwValue = rng.NextDouble() * frequencyTotal;
    int accumulator = 0.0;

    for(int i=0; i<nameEntryArr.Length; i++)
    {
        accumulator += nameEntryArr[i]._frequency;
        if(throwValue <= accumulator) {
            return nameEntryArr[i]._name;
        }
    }

    // If we get here then we have an array of zero fequencies.
    throw new ApplicationException("Invalid operation. No non-zero frequencies to select.");
}

score 4 · Accepted Answer

牛津大学在其公共 FTP 站点上以压缩 .gz 文件的形式在ftp://ftp.ox.ac.uk/pub/wordlists/names/上提供单词列表。

score 3 · Accepted Answer

您还可以查看 jFairy 项目。它是用 Java 编写的，并产生假数据（例如名称）。http://codearte.github.io/jfairy/

Fairy fairy = Fairy.create(); 
Person person = fairy.person();
System.out.println(person.firstName());           // Chloe
System.out.println(person.lastName());            // Barker
System.out.println(person.fullName());            // Chloe Barker

database - 有什么好的姓氏数据库吗？

3 回答 3

Related

Reference