我有一个带有推文的数据框。每行对应 1 条推文。我可以使用 AWS Comprehend batch_detect_key_phrases() 获取关键短语。batch_detect_key_phrases() 在负载中返回一个 ResultList 和 ErrorList。为了将关键短语结果合并回数据框中,它们需要与原始推文对齐,因此我需要保持 ResultList 和 ErrorList 对齐。
第267 行的代码分别处理 ErrorList 和 ResultList。
根据 Python Boto 文档,“ErrorList (list) - 一个列表,其中包含每个包含错误的文档的一个对象。结果按索引字段按升序排序,并与输入列表中文档的顺序相匹配。 ..”
我在下面编写的代码使用 ResultList 和 ErrorList 索引号来确保它们被正确地合并到一个 keyPhrases 列表中,然后该列表将被合并回原始数据框。本质上,keyPhrases[0] 是与数据帧第 0 行关联的关键短语。如果在处理推文时出现错误,则会将占位符错误消息添加到数据帧中的该行。
我认为我可以保持 ResultList 和 ErrorList 对齐的唯一另一种方法是将 2 个列表合并到一个更大的列表中,该列表按它们各自的索引升序排列。接下来,我将处理该 1 个更大的列表。
是否有更简单的方法来处理 ResultList 和 ErrorList 以使它们保持对齐?
keyphraseResults = {'ResultList': [
{'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]},
{'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},
{'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]},
{'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}],
'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
{"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}],
'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}
# Holds the ordered list of key phrases that correspond to the data frame.
keyPhrases = []
# Set it to an arbitrarily large number in case ErrorList below is empty we'll still
# need a number for comparison.
errIndexlist = [9999]
# This will be inserted for the rows corresponding to the ErrorList.
ErrorMessage = "* Error processing keyphrases"
# Since the rows of the response need to be kept in alignment with the rows of the dataframe,
# get the error indicies first, if any. These will be compared to the ResultList below.
if 'ErrorList' in keyphraseResults and len(keyphraseResults['ErrorList']) > 0:
batchErroresults = keyphraseResults["ErrorList"]
errIndexlist = []
for entry in batchErroresults:
errIndexlist.append(entry["Index"])
print(entry)
# Sort the indicies to ensure they are in ascending order since that order is
# important for the logic below.
errIndexlist.sort(reverse = False)
if 'ResultList' in keyphraseResults:
batchResults = keyphraseResults["ResultList"]
for entry in batchResults:
resultDict = entry["KeyPhrases"]
if len(errIndexlist) > 0:
if entry['Index'] < errIndexlist[0]:
results = ""
for textDict in resultDict:
results = results + ", " + textDict['Text']
# Remove the leading comma.
if len(results) > 1:
results = results[2:]
keyPhrases.append(results)
else:
# Else we have an error to merge from the PRIOR result.
keyPhrases.append(ErrorMessage)
errIndexlist.remove(errIndexlist[0])
# THEN add the key phrase for the current result.
results = ""
for textDict in resultDict:
results = results + ", " + textDict['Text']
# Remove the leading comma.
if len(results) > 1:
results = results[2:]
keyPhrases.append(results)
print("\nFinal results are:")
for text in keyPhrases:
print(text)