python - 如何更深入地了解为什么无法在 Watson Discovery Service 中摄取文档

Question

我正在使用 python 库的DiscoveryV1模块将watson_developer_cloud700 多个文档摄取到 WDS 集合中。每次我尝试批量摄取时，许多文档都无法被摄取，这是不确定的，通常大约有 100 个文档失败。

每次我打电话时discovery.add_document(env_id, cold_id, file_info=file_info)，我都会发现响应包含一个 WDS document_id。在对我的语料库中的所有文档进行此调用后，我使用相应document_id的 s 来调用discovery.get_document(env_id, col_id, doc_id)并检查文档的状态。其中大约 100 个调用将返回 status Document failed to be ingested and indexed。失败的文件之间没有模式，它们的大小以及 msword (doc) 和 pdf 文件类型的范围。

我提取文档的代码是基于WDS 文档编写的，它看起来像这样：

with open(f_path) as file_data:
    if f_path.endswith('.doc') or f_path.endswith('.docx'):
        re = discovery.add_document(env_id, col_id, file_info=file_data, mime_type='application/msword')                      
    else:                                                                                        
        re = discovery.add_document(env_id, col_id, file_info=file_data)

因为我的语料库比较大，大约 3gb，所以我会收到Service is busy processing...来自discovery.add_document(env_id, cold_id, file_info=file_info)调用的响应，在这种情况下我会调用并重sleep(5)试。

我已经用尽了 WDS 文档，但没有任何运气。我如何才能更深入地了解这些文件未能被摄取的原因？

score 2 · Accepted Answer

您应该能够使用https://watson-api-explorer.mybluemix.net/apis/discovery-v1#!/Queries/queryNotices API 查看摄取期间发生的错误/警告以及可能提供更多信息的详细信息关于摄取失败的原因。

不幸的是，在发布本文时，python SDK 似乎还没有封装此 API 的方法，因此您可以使用Watson Discovery Tooling或使用 curl 直接查询 API（将 {} 中的值替换为您的集合特定的值）

curl -u "{username}:{password}" "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/notices?version=2017-01-01

score 1 · Accepted Answer

现在python-sdk支持查询通知。

from watson_developer_cloud import DiscoveryV1

discovery = DiscoveryV1(
 version='2017-10-16',
 ## url is optional, and defaults to the URL below. Use the correct URL for your region.
 url='https://gateway.watsonplatform.net/discovery/api',
 iam_api_key='your_api_key')
discovery.federated_query_notices('env_id', ['collection_id']])

python - 如何更深入地了解为什么无法在 Watson Discovery Service 中摄取文档

2 回答 2

Related

Reference