python - 在通过 boto3 发送到 AWS Comprehend 之前如何按大小过滤文档？

Question

我目前正在尝试使用 boto3 库通过 AWS 的 Comprehend 服务对一组文档执行批量情绪分析。该服务对文档大小有一些限制（文档不能超过 5000 字节）；因此，我尝试在使用 boto3 API 之前预过滤文档。请参阅下面的代码片段：

...
batch = []
for doc in docs:
    if isinstance(doc, str) and len(doc) > 0 and sys.getsizeof(doc) < 5000:
        batch.append(doc)

data = self.client.batch_detect_sentiment(TextList=batch, LanguageCode=language)
...

我的假设是，尝试通过 using 过滤文档sys.getsizeof会导致过滤掉任何超出服务 5000 字节限制的字符串。但是，我的过滤仍然收到以下异常：

botocore.errorfactory.TextSizeLimitExceededException: An error occurred (TextSizeLimitExceededException) when calling the BatchDetectSentiment operation: Input text size exceeds limit. Max length of request text allowed is 5000 bytes while in this request the text size is 5523 bytes

为了避免达到最大文档大小限制，是否有更有效的方法来计算发送到 Comprehend 的文档大小？

score 0 · Accepted Answer

这里有两种方法：

正如 Daniel 提到的，您可以使用它len(doc.encode('utf-8'))来确定字符串的最终大小，因为它考虑了编码，而不仅仅是 python 字符串对象占用了多少内存。
您可以在异常发生时处理它。就像这样：

try:
    data = self.client.batch_detect_sentiment(TextList=batch, LanguageCode=language)
except self.client.exceptions.TextSizeLimitExceededException:
    print('The batch was too long')
else:
    print(data)

python - 在通过 boto3 发送到 AWS Comprehend 之前如何按大小过滤文档？

1 回答 1

Related

Reference