我强烈建议您使用的一件事是使用带有模糊匹配的查找表。如果您的实体数量有限(如国家名称),查找表会非常快,并且当您的查找表中存在该实体时,模糊匹配会捕获拼写错误(搜索这些实体的拼写变体)。这里有一整篇关于它的博文:在 Rasa 上。有一个模糊 wuzzy 作为自定义组件的工作实现:
class FuzzyExtractor(Component):
name = "FuzzyExtractor"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list ["en"]
threshold = 90
def __init__(self, component_config=None, *args):
super(FuzzyExtractor, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
pass
def process(self, message, **kwargs):
entities = list(message.get('entities'))
# Get file path of lookup table in json format
cur_path = os.path.dirname(__file__)
if os.name == 'nt':
partial_lookup_file_path = '..\\data\\lookup_master.json'
else:
partial_lookup_file_path = '../data/lookup_master.json'
lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)
with open(lookup_file_path, 'r') as file:
lookup_data = json.load(file)['data']
tokens = message.get('tokens')
for token in tokens:
# STOP_WORDS is just a dictionary of stop words from NLTK
if token.text not in STOP_WORDS:
fuzzy_results = process.extract(
token.text,
lookup_data,
processor=lambda a: a['value']
if isinstance(a, dict) else a,
limit=10)
for result, confidence in fuzzy_results:
if confidence >= self.threshold:
entities.append({
"start": token.offset,
"end": token.end,
"value": token.text,
"fuzzy_value": result["value"],
"confidence": confidence,
"entity": result["entity"]
})
file.close()
message.set("entities", entities, add_to_output=True)
但我没有实现它,它是在这里实现和验证的:Rasa forum
然后你只需将它传递到 config.yml 文件中的 NLU 管道。