0

我正在使用以下形式的数据:

name    phone   email   website
Diane Grant Albrecht M.S.           
Lannister G. Cersei M.A.T., CEP 111-222-3333    cersei@got.com  www.got.com
Argle D. Bargle Ed.M.           
Sam D. Man Ed.M.    000-000-1111    dman123@gmail.com   www.daManWithThePlan.com
Sam D. Man Ed.M.            
Sam D. Man Ed.M.    111-222-333     dman123@gmail.com   www.daManWithThePlan.com
D G Bamf M.S.           
Amy Tramy Lamy Ph.D.            

并想让它看起来像这样:

name    phone   email   website area    degree
Diane Grant Albrecht                    M.S.
Lannister G. Cersei 111-222-3333    cersei@got.com  www.got.com CEP M.A.T.
Argle D. Bargle                 Ed.M.
Sam D. Man  000-000-1111, 111-222-3333  dman123@gmail.com       dman123@gmail.com       Ed.M.
D G Bamf                    M.S.
Amy Tramy Lamy                  Ph.D.

您会注意到“姓名”字段可以包含一个人的姓名、学位和实践领域。

(您可能还注意到后两个“Sam D. Man...”条目丢失。对于这个问题,这无关紧要。在下一阶段,我删除重复项)

所以我首先浏览这个“名称”列并尝试解析名称列,以分离出实践领域(例如:CEP)和学位(例如:Ph.D.)。我尝试将这些写入创建的字段“区域”和“度”,并将修改/缩短的名称保存到“名称”字段。在本节末尾,理想情况下,每个“姓名”字段仅包含人名。

但是,当我运行脚本时,它对人的姓名字段没有影响。如何调整脚本以更改名称?

谢谢!

这是我评论的脚本,以帮助更容易消化:

# Stores a list of dictionaries, each dictionary containing a person's entry with keys corresponding to variable names (ex: [{'name':'Sam', 'phone':'111-111-1111'...},{}])
myjson = []
# Add fields 'area' and 'degree' to store area of pract and deg earned, which will be parsed from the 'name' field
with(open("ieca_first_col_fake_text.txt", "rU")) as f:
     sheet = csv.DictReader(f,delimiter="\t")
     sheet.fieldnames.append('flag')
     sheet.fieldnames.append('area')
     sheet.fieldnames.append('degree')
     for row in sheet:
        myjson.append(row) 

此时,我有一个名为“myjson”的字典列表。每个字典代表数据库中的一个条目。我继续查看“名称”字段:

degrees = ['M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.', 'RN', 'B.S.Ed.', 'M.D.', 'M.S.']

# Parse name element
for row in myjson:

    # check whether the name string has an area of practice by checking if there's a comma separator
    if ',' in row['name']:

        # separate area of practice from name and degree and bind this to var 'area'. If error, area is an empty list
        split_area_nmdeg = row['name'].split(',')
        try:
            row['area'].append(split_area_nmdeg.pop())
        except AttributeError:
            row['area'] = []

        # Split the name and deg by spaces. If there's a deg, it will match with one of elements and will be stored deg list. The deg is removed name_deg list and all that's left is the name.
        split_name_deg = re.split('\s',split_area_nmdeg[0])
        for word in split_name_deg:
            for deg in degrees:
                if deg == word:
                    try:
                        row['degree'].append(split_name_deg.pop())
                    except AttributeError:
                        row['degree'] = []
                row['name'] = ' '.join(split_name_deg)
                print row['name']

    # if the name string does not contain a comma and therefore does not contain an area of practice
    else:
        row['area'] = []
        split_name_deg = re.split('\s',row['name'])
        for word in split_name_deg:
            for deg in degrees:
                try:
                    if deg == word:
                        row['degree'].append(split_name_deg.pop())
                except AttributeError:
                    row['degree'] = []
                row['name'] = ' '.join(split_name_deg)
                print row['name']

检查输出:

for row in myjson:
    print row

看起来像这样:

{'website': '', 'name': 'Diane Grant Albrecht M.S.', 'degree': [], 'area': [], 'phone': '', 'flag': None, 'email': ''}
{'website': 'www.got.com', 'name': 'Lannister G. Cersei M.A.T.', 'degree': [], 'area': [], 'phone': '111-222-3333', 'flag': None, 'email': 'cersei@got.com'}
{'website': '', 'name': 'Argle D. Bargle Ed.M.', 'degree': [], 'area': [], 'phone': '', 'flag': None, 'email': ''}
{'website': 'www.daManWithThePlan.com', 'name': 'Sam D. Man Ed.M.', 'degree': [], 'area': [], 'phone': '000-000-1111', 'flag': None, 'email': 'dman123@gmail.com'}
{'website': '', 'name': 'Sam D. Man Ed.M.', 'degree': [], 'area': [], 'phone': '', 'flag': None, 'email': ''}
{'website': 'www.daManWithThePlan.com', 'name': 'Sam D. Man Ed.M.', 'degree': [], 'area': [], 'phone': '111-222-333', 'flag': None, 'email': '    dman123@gmail.com'}
{'website': '', 'name': 'D G Bamf M.S.', 'degree': [], 'area': [], 'phone': '', 'flag': None, 'email': ''}
{'website': '', 'name': 'Amy Tramy Lamy Ph.D.', 'degree': [], 'area': [], 'phone': '', 'flag': None, 'email': ''}
first_row {'website': '', 'name': 'Diane Grant Albrecht M.S.', 'degree': [], 'area': [], 'phone': '', 'email': ''}
4

1 回答 1

1

我认为您确定学位是否在名称中的方法不起作用。不幸的是,我无法进行完整的测试,因为当我将您的示例数据粘贴到文本文件中时,我认为选项卡没有正确维护,因此将数据读入字典不起作用。但是,使用上面打印行中显示的输出,我创建了一个字典并运行下面的代码似乎可以找到度数并将它们分成一个单独的字段:

for row in myjson:
    for d in degrees:
        if d in row['name']:
            row['degree'] = d
            row['name'] = row['name'][:row['name'].find(d)]
于 2013-07-09T16:56:20.957 回答