python-3.x - 我正在使用 IBM Cloud Object Storage 并希望从存储中读取 pdf 文件并希望以字符串的形式存储其文本内容

Question

我使用了 IBM COS 文档中提到的 ibm_boto3。我已将资源定义如下：

cos = ibm_boto3.resource("s3",
    ibm_api_key_id=COS_API_KEY_ID,
    ibm_service_instance_id=SERVICE_INSTANCE_ID,
    ibm_auth_endpoint=COS_AUTH_ENDPOINT,
    config=Config(signature_version="oauth"),
    endpoint_url=COS_ENDPOINT
)

以下是我用来获取 pdf 文件内容的代码：

def get_item(bucket_name, item_name):
    print("Retrieving item from bucket: {0}, key: {1}".format(bucket_name, item_name))
    try:
        file = cos.Object(bucket_name, item_name).get()
        file_content = file["Body"].read() #returns data in bytes
        #print("\nFILE:-------------------------\n", file) #shows the meta data of the object
        return file_content
    except ClientError as be:
        print("CLIENT ERROR: {0}\n".format(be))
    except Exception as e:
        print("Unable to retrieve file contents: {0}\n".format(e))

该对象属于ibm_botocore.response.StreamingBody 对象类型。我无法将以字节为单位的数据转换为字符串。我曾尝试使用utf-8和base64进行解码，但不起作用。尝试使用utf-8解码时出现以下错误：

无法检索文件内容：“utf-8”编解码器无法解码位置 11 中的字节 0xb5：无效的起始字节

我也无法弄清楚 IBM COS 使用什么类型的编码。

提前致谢。

python-3.x - 我正在使用 IBM Cloud Object Storage 并希望从存储中读取 pdf 文件并希望以字符串的形式存储其文本内容

0 回答 0

Related

Reference