ibm-cloud - Watson Retrieve and Rank/Discovery Service 总是返回最高（est）分数的目录

Question

背景：

我正在使用 Watson Retrieve and Rank/ 或 Discovery Service 从用户手册中检索信息。我使用 pdf 格式的示例洗衣机手册进行了培训。我的目标是从出现特定自然语言字符串的文档中获取最佳段落（例如“定位排水管”）。这通常是有效的。

我的问题是目录几乎总是得分最高的段落。因此，第一个结果只是目录而不是相关的文本段落。（见示例结果）

“错误”结果（目录）：

Unpacking the washing machine ----------------------------------------------------2 Overview of the washing machine --------------------------------------------------2 Selecting a location -------------------------------------------------------------------- 3 Adjusting the leveling feet ------------------------------------------------------------3 Removing the shipping bolts --------------------------------------------------------3 Connecting the water supply hose ------------------------------------------------- 3 Positioning the drain hose ----------------------------------------------------------- 4 Plugging in the machine

“正确”的结果

Positioning the drain hose The end of the drain hose may be positioned in three ways: Over the edge of a sink The drain hose must be placed at a height of between 60 and 90 cm. To keep the drain hose spout bent, use the supplied plastic hose

可能的解决方案

在培训过程中忽略目录
偏移参数，例如忽略前 3 个结果
找出结果是否是目录的一部分，如果是则忽略

这些方法是静态的，不适用于具有各种结构的多个文档（开头的目录/结尾的目录/没有目录，...）。

有人想更好地处理这个话题吗？

score 0 · Accepted Answer

此时，段落检索结果不受相关性训练的影响。由于段落检索总是搜索整个语料库，不幸的是，从目录中排除段落检索结果的唯一可靠方法是删除目录。

ibm-cloud - Watson Retrieve and Rank/Discovery Service 总是返回最高（est）分数的目录

背景：

“错误”结果（目录）：

“正确”的结果

可能的解决方案

1 回答 1

Related

Reference