rasa-nlu - 如何处理 Rasa NLU 实体提取中的拼写错误（错别字）？

Question

我的训练集（nlu_data.md 文件）中几乎没有意图，每个意图下都有足够数量的训练示例。下面是一个例子，

##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai

我已经添加了多个这样的句子。在测试时，训练文件中的所有句子都可以正常工作。但是，如果任何输入查询有拼写错误，例如酒店关键字的 hotol/hetel/hotele，那么 Rasa NLU 无法将其提取为实体。

我想解决这个问题。我只能更改训练数据，也不能为此编写任何自定义组件。

score 1 · Accepted Answer

要在实体中处理此类拼写错误，您应该将这些示例添加到您的训练数据中。所以是这样的：

##intent: SEARCH_HOTEL
 - find good [hotel](place) for me in Mumbai 
 - looking for a [hotol](place) in Chennai
 - [hetel](place) in Berlin please

一旦你添加了足够多的例子，模型应该能够从句子结构中进行概括。

如果您还没有使用它，那么使用字符级CountVectorFeaturizer也是有意义的。这应该已经在此页面上描述的默认管道中

score 0 · Accepted Answer

Its a strange request that they ask you not to change the code or do custom components.

The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:

 ##intent: SEARCH_HOTEL
 - find good [hotel](place) for me in Mumbai 
 - looking for a [hotol](place:hotel) in Chennai
 - [hetel](place:hotel) in Berlin please

This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)

score 0 · Accepted Answer

首先，按照此处的建议为您的实体添加最常见的拼写错误示例

除此之外，您还需要一个拼写检查器。

我不确定是否有可以在管道中使用的单个库，但如果没有，您需要创建自定义组件。否则，仅处理训练数据是不可行的。您不能为每个错字创建示例。使用 Fuzzywuzzy 是其中一种方法，一般来说，它很慢并且不能解决所有问题。通用编码器是另一种解决方案。应该有更多的拼写纠正选项，但您需要以任何方式编写代码。

score 0 · Accepted Answer

我强烈建议您使用的一件事是使用带有模糊匹配的查找表。如果您的实体数量有限（如国家名称），查找表会非常快，并且当您的查找表中存在该实体时，模糊匹配会捕获拼写错误（搜索这些实体的拼写变体）。这里有一整篇关于它的博文：在 Rasa 上。有一个模糊 wuzzy 作为自定义组件的工作实现：

class FuzzyExtractor(Component):
    name = "FuzzyExtractor"
    provides = ["entities"]
    requires = ["tokens"]
    defaults = {}
    language_list  ["en"]
    threshold = 90

    def __init__(self, component_config=None, *args):
        super(FuzzyExtractor, self).__init__(component_config)

    def train(self, training_data, cfg, **kwargs):
        pass

    def process(self, message, **kwargs):

        entities = list(message.get('entities'))

        # Get file path of lookup table in json format
        cur_path = os.path.dirname(__file__)
        if os.name == 'nt':
            partial_lookup_file_path = '..\\data\\lookup_master.json'
        else:
            partial_lookup_file_path = '../data/lookup_master.json'
        lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)

        with open(lookup_file_path, 'r') as file:
            lookup_data = json.load(file)['data']

            tokens = message.get('tokens')

            for token in tokens:

                # STOP_WORDS is just a dictionary of stop words from NLTK
                if token.text not in STOP_WORDS:

                    fuzzy_results = process.extract(
                                             token.text, 
                                             lookup_data, 
                                             processor=lambda a: a['value'] 
                                                 if isinstance(a, dict) else a, 
                                             limit=10)

                    for result, confidence in fuzzy_results:
                        if confidence >= self.threshold:
                            entities.append({
                                "start": token.offset,
                                "end": token.end,
                                "value": token.text,
                                "fuzzy_value": result["value"],
                                "confidence": confidence,
                                "entity": result["entity"]
                            })

        file.close()

        message.set("entities", entities, add_to_output=True)

但我没有实现它，它是在这里实现和验证的：Rasa forum 然后你只需将它传递到 config.yml 文件中的 NLU 管道。

rasa-nlu - 如何处理 Rasa NLU 实体提取中的拼写错误（错别字）？

4 回答 4

Related

Reference