Find centralized, trusted content and collaborate around the technologies you use most.
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
我有什么方法可以使用scikit-learn 中的文本CountVectorizer或参数从我的文本文档中保留标点符号 !、?、" 和 ' ?TfidfVectorizer
CountVectorizer
TfidfVectorizer
您应该token_pattern在实例化矢量化器时自定义参数。例如:
token_pattern
vent = CountVectorizer(token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'")