我正在Pytorch 和 Tensorflow2 中实现Skipgram模型。我对常用词的二次抽样的实施有疑问。逐字逐句地从论文中得出,对单词进行二次采样的概率wi
计算为
其中t
是自定义阈值(通常是一个小的值,例如0.0001)并且f
是文档中单词的频率。尽管作者以不同但几乎相同的方式实现了它,但让我们坚持这个定义。
在计算 时P(wi)
,我们可以得到负值。例如,假设我们有 100 个单词,其中一个单词出现的频率比其他单词高得多(我的数据集就是这种情况)。
import numpy as np
import seaborn as sns
np.random.seed(12345)
# generate counts in [1, 20]
counts = np.random.randint(low=1, high=20, size=99)
# add an extremely bigger count
counts = np.insert(counts, 0, 100000)
# compute frequencies
f = counts/counts.sum()
# define threshold as in paper
t = 0.0001
# compute probabilities as in paper
probs = 1 - np.sqrt(t/f)
sns.distplot(probs);
问:使用这种“概率”实现二次抽样的正确方法是什么?
作为附加信息,我已经看到在keras中,该函数keras.preprocessing.sequence.make_sampling_table
采用了不同的方法:
def make_sampling_table(size, sampling_factor=1e-5):
"""Generates a word rank-based probabilistic sampling table.
Used for generating the `sampling_table` argument for `skipgrams`.
`sampling_table[i]` is the probability of sampling
the i-th most common word in a dataset
(more common words should be sampled less frequently, for balance).
The sampling probabilities are generated according
to the sampling distribution used in word2vec:
```
p(word) = (min(1, sqrt(word_frequency / sampling_factor) /
(word_frequency / sampling_factor)))
```
We assume that the word frequencies follow Zipf's law (s=1) to derive
a numerical approximation of frequency(rank):
`frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`
where `gamma` is the Euler-Mascheroni constant.
# Arguments
size: Int, number of possible words to sample.
sampling_factor: The sampling factor in the word2vec formula.
# Returns
A 1D Numpy array of length `size` where the ith entry
is the probability that a word of rank i should be sampled.
"""
gamma = 0.577
rank = np.arange(size)
rank[0] = 1
inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank)
f = sampling_factor * inv_fq
return np.minimum(1., f / np.sqrt(f))