我正在使用以下形式的数据:
name phone email website
Diane Grant Albrecht M.S.
Lannister G. Cersei M.A.T., CEP 111-222-3333 cersei@got.com www.got.com
Argle D. Bargle Ed.M.
Sam D. Man Ed.M. 000-000-1111 dman123@gmail.com www.daManWithThePlan.com
Sam D. Man Ed.M.
Sam D. Man Ed.M. 111-222-333 dman123@gmail.com www.daManWithThePlan.com
D G Bamf M.S.
Amy Tramy Lamy Ph.D.
并想让它看起来像这样:
name phone email website area degree
Diane Grant Albrecht M.S.
Lannister G. Cersei 111-222-3333 cersei@got.com www.got.com CEP M.A.T.
Argle D. Bargle Ed.M.
Sam D. Man 000-000-1111, 111-222-3333 dman123@gmail.com dman123@gmail.com Ed.M.
D G Bamf M.S.
Amy Tramy Lamy Ph.D.
您会注意到“姓名”字段可以包含一个人的姓名、学位和实践领域。
(您可能还注意到后两个“Sam D. Man...”条目丢失。对于这个问题,这无关紧要。在下一阶段,我删除重复项)
所以我首先浏览这个“名称”列并尝试解析名称列,以分离出实践领域(例如:CEP)和学位(例如:Ph.D.)。我尝试将这些写入创建的字段“区域”和“度”,并将修改/缩短的名称保存到“名称”字段。在本节末尾,理想情况下,每个“姓名”字段仅包含人名。
但是,当我运行脚本时,它对人的姓名字段没有影响。如何调整脚本以更改名称?
谢谢!
这是我评论的脚本,以帮助更容易消化:
# Stores a list of dictionaries, each dictionary containing a person's entry with keys corresponding to variable names (ex: [{'name':'Sam', 'phone':'111-111-1111'...},{}])
myjson = []
# Add fields 'area' and 'degree' to store area of pract and deg earned, which will be parsed from the 'name' field
with(open("ieca_first_col_fake_text.txt", "rU")) as f:
sheet = csv.DictReader(f,delimiter="\t")
sheet.fieldnames.append('flag')
sheet.fieldnames.append('area')
sheet.fieldnames.append('degree')
for row in sheet:
myjson.append(row)
此时,我有一个名为“myjson”的字典列表。每个字典代表数据库中的一个条目。我继续查看“名称”字段:
degrees = ['M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.', 'RN', 'B.S.Ed.', 'M.D.', 'M.S.']
# Parse name element
for row in myjson:
# check whether the name string has an area of practice by checking if there's a comma separator
if ',' in row['name']:
# separate area of practice from name and degree and bind this to var 'area'. If error, area is an empty list
split_area_nmdeg = row['name'].split(',')
try:
row['area'].append(split_area_nmdeg.pop())
except AttributeError:
row['area'] = []
# Split the name and deg by spaces. If there's a deg, it will match with one of elements and will be stored deg list. The deg is removed name_deg list and all that's left is the name.
split_name_deg = re.split('\s',split_area_nmdeg[0])
for word in split_name_deg:
for deg in degrees:
if deg == word:
try:
row['degree'].append(split_name_deg.pop())
except AttributeError:
row['degree'] = []
row['name'] = ' '.join(split_name_deg)
print row['name']
# if the name string does not contain a comma and therefore does not contain an area of practice
else:
row['area'] = []
split_name_deg = re.split('\s',row['name'])
for word in split_name_deg:
for deg in degrees:
try:
if deg == word:
row['degree'].append(split_name_deg.pop())
except AttributeError:
row['degree'] = []
row['name'] = ' '.join(split_name_deg)
print row['name']
检查输出:
for row in myjson:
print row
看起来像这样:
{'website': '', 'name': 'Diane Grant Albrecht M.S.', 'degree': [], 'area': [], 'phone': '', 'flag': None, 'email': ''}
{'website': 'www.got.com', 'name': 'Lannister G. Cersei M.A.T.', 'degree': [], 'area': [], 'phone': '111-222-3333', 'flag': None, 'email': 'cersei@got.com'}
{'website': '', 'name': 'Argle D. Bargle Ed.M.', 'degree': [], 'area': [], 'phone': '', 'flag': None, 'email': ''}
{'website': 'www.daManWithThePlan.com', 'name': 'Sam D. Man Ed.M.', 'degree': [], 'area': [], 'phone': '000-000-1111', 'flag': None, 'email': 'dman123@gmail.com'}
{'website': '', 'name': 'Sam D. Man Ed.M.', 'degree': [], 'area': [], 'phone': '', 'flag': None, 'email': ''}
{'website': 'www.daManWithThePlan.com', 'name': 'Sam D. Man Ed.M.', 'degree': [], 'area': [], 'phone': '111-222-333', 'flag': None, 'email': ' dman123@gmail.com'}
{'website': '', 'name': 'D G Bamf M.S.', 'degree': [], 'area': [], 'phone': '', 'flag': None, 'email': ''}
{'website': '', 'name': 'Amy Tramy Lamy Ph.D.', 'degree': [], 'area': [], 'phone': '', 'flag': None, 'email': ''}
first_row {'website': '', 'name': 'Diane Grant Albrecht M.S.', 'degree': [], 'area': [], 'phone': '', 'email': ''}