python - Python 中的 Soundex 算法（作业帮助请求）

Question

美国人口普查局使用一种称为“soundex”的特殊编码来定位有关个人的信息。soundex 是一种基于姓氏发音方式而不是拼写方式的姓氏（姓氏）编码。听起来相同但拼写不同的姓氏，如 SMITH 和 SMYTH，具有相同的代码并一起归档。soundex 编码系统的开发是为了让您可以找到一个姓氏，即使它可能是用各种拼写记录的。

在本实验中，您将设计、编码和记录一个程序，该程序在输入姓氏时会生成 soundex 代码。系统将提示用户输入姓氏，程序应输出相应的代码。

基本 Soundex 编码规则

姓氏的每个 soundex 编码都由一个字母和三个数字组成。使用的字母始终是姓氏的第一个字母。根据下面显示的 soundex 指南，将数字分配给姓氏的其余字母。如有必要，在末尾添加零以始终生成四字符代码。忽略其他字母。

Soundex 编码指南

Soundex 为各种辅音分配一个编号。发音相似的辅音被分配相同的数字：

数字辅音

1 B、F、P、V 2 C、G、J、K、Q、S、X、Z 3 D、T 4 L 5 M、N 6 R

Soundex 忽略字母 A、E、I、O、U、H、W 和 Y。

遵循 3 个额外的 Soundex 编码规则。一个好的程序设计会将这些实现为一个或多个单独的功能。

规则 1. 双字母的名字

如果姓氏有任何双字母，则应将其视为一个字母。例如：

Gutierrez 的编码是 G362（G，3 代表 T，6 代表第一个 R，第二个 R 被忽略，2 代表 Z）。规则 2. 具有相同 Soundex 代码编号的并排字母名称

如果姓氏在 soundex 编码指南中并排有不同的字母且具有相同的数字，则应将它们视为一个字母。例子：

Pfister 编码为 P236（P，F 被忽略，因为它被认为与 P 相同，S 为 2，T 为 3，R 为 6）。

Jackson 编码为 J250（J，C 为 2，K 与 C 相同忽略，S 与 C 相同忽略，N 为 5，添加 0）。

规则 3. 辅音分隔符

3.a. 如果元音 (A, E, I, O, U) 分隔两个具有相同 soundex 代码的辅音，则元音右侧的辅音被编码。例子：

Tymczak 编码为 T-522（T，5 代表 M，2 代表 C，Z 被忽略（参见上面的“并排”规则），2 代表 K）。由于元音“A”将 Z 和 K 分开，因此 K 被编码。3.b。如果 "H" 或 "W" 分隔两个具有相同 soundex 代码的辅音，则右侧的辅音不被编码。例子：

*Ashcraft 编码为 A261（A，S 为 2，C 被忽略，因为与 S 相同，中间有 H，R 为 6，F 为 1）。它没有编码 A226。

到目前为止，这是我的代码：

surname = raw_input("Please enter surname:")
outstring = ""

outstring = outstring + surname[0]
for i in range (1, len(surname)):
        nextletter = surname[i]
        if nextletter in ['B','F','P','V']:
            outstring = outstring + '1'

        elif nextletter in ['C','G','J','K','Q','S','X','Z']:
            outstring = outstring + '2'

        elif nextletter in ['D','T']:
            outstring = outstring + '3'

        elif nextletter in ['L']:
            outstring = outstring + '4'

        elif nextletter in ['M','N']:
            outstring = outstring + '5'

        elif nextletter in ['R']:
            outstring = outstring + '6'

print outstring

足以满足要求，我只是不确定如何编写这三个规则。那就是我需要帮助的地方。因此，任何帮助表示赞赏。

score 1 · Accepted Answer

我建议您尝试以下方法。

在附加到输出之前存储要使用的 CurrentCoded 和 LastCoded 变量
将系统分解为有用的功能，例如
1. Boolean IsVowel(Char)
2. 整数编码（字符）
3. Boolean IsRule1(Char, Char)

一旦你很好地分解它，它应该变得更容易管理。

score 0 · Accepted Answer

这几乎不是完美的（例如，如果输入不是以字母开头，它会产生错误的结果），并且它没有将规则实现为独立可测试的函数，因此它不会真正作为答案作业问题。但这就是我实现它的方式：

>>> def soundex_prepare(s):
        """Prepare string for Soundex encoding.

        Remove non-alpha characters (and the not-of-interest W/H/Y), 
        convert to upper case, and remove all runs of repeated letters."""
        p = re.compile("[^a-gi-vxz]", re.IGNORECASE)
        s = re.sub(p, "", s).upper()
        for c in set(s):
            s = re.sub(c + "{2,}", c, s)
        return s

>>> def soundex_encode(s):
        """Encode a name string using the Soundex algorithm."""
        result = s[0].upper()
        s = soundex_prepare(s[1:])
        letters = 'ABCDEFGIJKLMNOPQRSTUVXZ'
        codes   = '.123.12.22455.12623.122'
        d = dict(zip(letters, codes))
        prev_code=""
        for c in s:
            code = d[c]
            if code != "." and code != prev_code:
                result += code
         if len(result) >= 4: break
            prev_code = code
        return (result + "0000")[:4]

score 0 · Accepted Answer

surname = input("Enter surname of the author: ") #asks user to input the author's surname

while surname != "": #initiates a while loop thats loops on as long as the input is not equal to an empty line

    str_ini = surname[0] #denotes the initial letter of the surname string
    mod_str1 = surname[1:] #denotes modified string excluding the first letter of the surname

    import re #importing re module to access the sub function
    mod_str2 = re.sub(r'[aeiouyhwAEIOUYHW]', '', mod_str1) #eliminating any instances of the given letters


    mod_str21 = re.sub(r'[bfpvBFPV]', '1', mod_str2)
    mod_str22 = re.sub(r'[cgjkqsxzCGJKQSXZ]', '2', mod_str21)
    mod_str23 = re.sub(r'[dtDT]', '3', mod_str22)
    mod_str24 = re.sub(r'[lL]', '4', mod_str23)
    mod_str25 = re.sub(r'[mnMN]', '5', mod_str24)
    mod_str26 = re.sub(r'[rR]', '6', mod_str25)
                #substituting given letters with specific numbers as required by the soundex algorithm

    mod_str3 = str_ini.upper()+mod_str26 #appending the surname initial with the remaining modified trunk

    import itertools #importing itertools module to access the groupby function
    mod_str4 = ''.join(char for char, rep in itertools.groupby(mod_str3))
                #grouping each character of the string into individual characters
                #removing sequences of identical numbers with a single number
                #joining the individually grouped characters into a string

    mod_str5 = (mod_str4[:4]) #setting character limit of the modified string upto the fourth place

    if len (mod_str5) == 1:
        print (mod_str5 + "000\n")
    elif len (mod_str5) == 2:
        print (mod_str5 + "00\n")
    elif len (mod_str5) == 3:
        print (mod_str5 + "0\n")
    else:
        print (mod_str5 + "\n")
                #using if, elif and else arguments for padding with trailing zeros

    print ("Press enter to exit") #specification for the interactor, to press enter (i.e., equivalent to a new line for breaking the while loop) when he wants to exit the program
    surname = input("Enter surname of the author: ") #asking next input from the user if he wants to carry on

exit(0) #exiting the program at the break of the while loop

python - Python 中的 Soundex 算法（作业帮助请求）

3 回答 3

Related

Reference