基本上,假设我有一个词汇表短语
- University of Texas Dallas
- University of Tokyo
- University of Toronto
假设我有 3 个文件
- doc1: I study at University of Texas Dallas and its awsome
- doc2: I study at University of Tokyo and its awsome
- doc3: I study at University of Toronto and its awsome
通过使用空格标记器,将在索引中识别以下标记
-doc1: ["i", "study", "in", "university", "of", "texas", "dallas", "and", "its", "awsome"]
-doc2: ["i", "study", "in", "university", "of", "tokyo", "and", "its", "awsome"]
-doc3: ["i", "study", "in", "university", "of", "toronto", "and", "its", "awsome"]
但是,由于我有一个已知的“词汇表”,我想用词组标记化并实现以下目标
-doc1: ["i", "study", "in", "university of texas dallas", "and", "its", "awsome"]
-doc2: ["i", "study", "in", "university of tokyo", "and", "its", "awsome"]
-doc3: ["i", "study", "in", "university of toronto", "and", "its", "awsome"]
给定词汇表中的短语列表,如何实现短语标记化?