3

我正在根据单词出现对文本进行分类。其中一个步骤是估计每个可能类别的特定文本的概率。为此,我从 NFEATURES 单词的词汇表中获得了 NSAMPLES 文本,每个都标有 NLABELS 类标签之一。由此,我构造了一个二进制出现矩阵,其中 entry(sample,feature) 为 1 iff 文本“sample”包含由“feature”编码的单词。

从出现矩阵中,我们可以构造一个条件概率矩阵,然后对其进行平滑处理,使概率既不是 0.0 也不是 1.0,使用以下代码(从 Coursera 笔记本复制):

def laplace_smoothing(labels, binary_data, n_classes):
    # Compute the parameter estimates (adjusted fraction of documents in class that contain word)
    n_words = binary_data.shape[1]
    alpha = 1 # parameters for Laplace smoothing
    theta = np.zeros([n_classes, n_words]) # stores parameter values - prob. word given class
    for c_k in range(n_classes): # 0, 1, ..., 19
        class_mask = (labels == c_k)
        N = class_mask.sum() # number of articles in class
        theta[c_k, :] = (binary_data[class_mask, :].sum(axis=0) + alpha)/(N + alpha*2)
    return theta

要查看问题,这里是模拟输入并调用结果的代码:

import tensorflow_probability as tfp
tfd = tfp.distributions

NSAMPLES = 2000   # Size of corpus
NFEATURES = 10000 # Number of words in corpus
NLABELS = 10      # Number of classes
ONE_PROB = 0.02   # Probability that binary_datum will be 1

def mock_binary_data( nsamples, nfeatures, one_prob ):
    binary_data = ( np.random.uniform( 0, 1, ( nsamples, nfeatures ) ) < one_prob ).astype( 'int32' )
    return binary_data

def mock_labels( nsamples, nlabels ):
    labels = np.random.randint( 0, nlabels, nsamples )
    return labels

binary_data = mock_binary_data( NSAMPLES, NFEATURES, ONE_PROB )
labels = mock_labels( NSAMPLES, NLABELS )
smoothed_data = laplace_smoothing( labels, binary_data, NLABELS )

bernoulli = tfd.Independent( tfd.Bernoulli( probs = smoothed_data ), reinterpreted_batch_ndims = 1 )

test_random_data = mock_binary_data( 1, NFEATURES, ONE_PROB )[ 0 ]
bernoulli.prob( test_random_data )

当我执行此操作时,我得到:

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>

也就是说,所有的概率都是零。这里的某些步骤不正确,您能帮我找到吗?

4

0 回答 0