0

基本上,假设我有一个词汇表短语

- University of Texas Dallas
- University of Tokyo
- University of Toronto

假设我有 3 个文件

- doc1: I study at University of Texas Dallas and its awsome 
- doc2: I study at University of Tokyo and its awsome            
- doc3: I study at University of Toronto and its awsome

通过使用空格标记器,将在索引中识别以下标记

-doc1: ["i", "study", "in", "university", "of", "texas", "dallas", "and", "its", "awsome"]
-doc2: ["i", "study", "in", "university", "of", "tokyo", "and", "its", "awsome"]
-doc3: ["i", "study", "in", "university", "of", "toronto", "and", "its", "awsome"]

但是,由于我有一个已知的“词汇表”,我想用词组标记化并实现以下目标

-doc1: ["i", "study", "in", "university of texas dallas", "and", "its", "awsome"]
-doc2: ["i", "study", "in", "university of tokyo", "and", "its", "awsome"]
-doc3: ["i", "study", "in", "university of toronto", "and", "its", "awsome"]

给定词汇表中的短语列表,如何实现短语标记化?

4

0 回答 0