0

我有一个带有推文的数据框。每行对应 1 条推文。我可以使用 AWS Comprehend batch_detect_key_phrases() 获取关键短语。batch_detect_key_phrases() 在负载中返回一个 ResultList 和 ErrorList。为了将关键短语结果合并回数据框中,它们需要与原始推文对齐,因此我需要保持 ResultList 和 ErrorList 对齐。

第267 行的代码分别处理 ErrorList 和 ResultList。

根据 Python Boto 文档,“ErrorList (list) - 一个列表,其中包含每个包含错误的文档的一个对象。结果按索引字段按升序排序,并与输入列表中文档的顺序相匹配。 ..”

我在下面编写的代码使用 ResultList 和 ErrorList 索引号来确保它们被正确地合并到一个 keyPhrases 列表中,然后该列表将被合并回原始数据框。本质上,keyPhrases[0] 是与数据帧第 0 行关联的关键短语。如果在处理推文时出现错误,则会将占位符错误消息添加到数据帧中的该行。

我认为我可以保持 ResultList 和 ErrorList 对齐的唯一另一种方法是将 2 个列表合并到一个更大的列表中,该列表按它们各自的索引升序排列。接下来,我将处理该 1 个更大的列表。

是否有更简单的方法来处理 ResultList 和 ErrorList 以使它们保持对齐?

keyphraseResults = {'ResultList': [
            {'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]}, 
            {'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},                 
            {'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]}, 
            {'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}], 
'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
              {"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}], 
'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}

# Holds the ordered list of key phrases that correspond to the data frame. 
keyPhrases = []

# Set it to an arbitrarily large number in case ErrorList below is empty we'll still 
# need a number for comparison. 
errIndexlist = [9999]

# This will be inserted for the rows corresponding to the ErrorList. 
ErrorMessage = "* Error processing keyphrases"

# Since the rows of the response need to be kept in alignment with the rows of the dataframe, 
# get the error indicies first, if any. These will be compared to the ResultList below.
if 'ErrorList' in keyphraseResults and len(keyphraseResults['ErrorList']) > 0:
    batchErroresults = keyphraseResults["ErrorList"]
    errIndexlist = []

    for entry in batchErroresults:
        errIndexlist.append(entry["Index"])
        print(entry)

# Sort the indicies to ensure they are in ascending order since that order is 
# important for the logic below. 
errIndexlist.sort(reverse = False)

if 'ResultList' in keyphraseResults:

    batchResults = keyphraseResults["ResultList"]

    for entry in batchResults:

        resultDict = entry["KeyPhrases"]

        if len(errIndexlist) > 0:

            if entry['Index'] < errIndexlist[0]:

                results = ""
                for textDict in resultDict: 
                    results = results + ", " + textDict['Text']

                # Remove the leading comma.
                if len(results) > 1:
                    results = results[2:]

                keyPhrases.append(results)

            else:
                # Else we have an error to merge from the PRIOR result.
                keyPhrases.append(ErrorMessage)
                errIndexlist.remove(errIndexlist[0])

                # THEN add the key phrase for the current result.
                results = ""
                for textDict in resultDict: 
                    results = results + ", " + textDict['Text']

                # Remove the leading comma.
                if len(results) > 1:
                    results = results[2:]

                keyPhrases.append(results)

print("\nFinal results are:")
for text in keyPhrases:
    print(text)
4

1 回答 1

0

我根据这个SO post弄清楚了。

总的来说,合并ResultList和ErrorList,在Index上对合并后的列表进行排序,然后依次处理合并后的列表。

from operator import itemgetter

keyphraseResults = {'ResultList': [
        {'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]}, 
        {'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},                 
        {'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]}, 
        {'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}], 
        'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
          {"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}], 
        'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b',   'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}

keyPhrases = []

# This will be inserted for the rows in ErrorList or just make it empty. 
ErrorMessage = "* Error processing keyphrases"

if len(keyphraseResults["ResultList"]) > 0 and len(keyphraseResults["ErrorList"]) > 0:
    processResults = keyphraseResults["ResultList"].copy() + keyphraseResults["ErrorList"].copy()
elif len(keyphraseResults["ResultList"]) > 0:
    processResults = keyphraseResults["ResultList"].copy()
else:
    processResults = keyphraseResults["ErrorList"].copy()

processResults = sorted(processResults, key=itemgetter('Index'), reverse = False)

for entry in processResults:

    if 'ErrorCode' in entry:
        keyPhrases.append(ErrorMessage)

    elif 'KeyPhrases' in entry:
        resultDict = entry["KeyPhrases"]

        results = ""
        for textDict in resultDict: 
            results = results + ", " + textDict['Text']

        # Remove the leading comma.
        if len(results) > 2:
            results = results[2:]

        keyPhrases.append(results)

print("\nFinal results are:")
for text in keyPhrases:
    print(text)
于 2020-06-07T20:17:29.330 回答