machine-learning - Vocab 和 Integer（一个热门）表示是如何存储的，（'string', int）元组在 torchtext.vocab() 中的含义是什么？

Question

我正在尝试训练RNN二进制分类。我的词汇由 1000000 个单词组成，请找到以下输出...

text_field = torchtext.data.Field(tokenize=word_tokenize)

print(text_field.vocab.freqs.most_common(15))
>>
[('.', 516822), (',', 490533), ('the', 464796), ('to', 298670), ("''", 264416), ('of', 226307), ('I', 224927), ('and', 215722), ('a', 211773), ('is', 180965), ('you', 180359), ('``', 165889), ('that', 156425), ('in', 138038), (':', 132294)]

print(text_field.vocab.itos[:15])
>>
['<unk>', '<pad>', '.', ',', 'the', 'to', "''", 'of', 'I', 'and', 'a', 'is', 'you', '``', 'that']

text_field.vocab.stoi
>>
{'<unk>': 0,'<pad>': 1,'.': 2,',': 3,'the': 4,'to': 5,"''": 6,'of': 7,'I': 8,'and': 9,'a': 10, 'is': 11,'you': 12,'``': 13,'that': 14,'in': 15,....................

文档说：

freqs – A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.
stoi – A collections.defaultdict instance mapping token strings to numerical identifiers.
itos – A list of token strings indexed by their numerical identifiers.

我无法理解。

有人可以通过给出每一个的直觉来解释这些是什么吗？

例如，如果the由表示4，那么是否意味着如果一个句子包含单词the，

位置 4 会是 1 吗？或者
它会在位置 464796 处为 1 还是
464796 的位置会是 4 吗？

当有多个时会发生什么the？

score 1 · Accepted Answer

如果“the”由 4 表示，那么这意味着

itos[4]是个”
stoi["the"]是 4
('the', <count>)中某处有一个元组freqs，其中count'the' 出现在输入文本中的次数。该计数与其数字标识符 4 无关。

machine-learning - Vocab 和 Integer（一个热门）表示是如何存储的，（'string', int）元组在 torchtext.vocab() 中的含义是什么？

1 回答 1

Related

Reference