我将给出一个贝叶斯层次模型解决方案。它有一些必须手动设置的参数,但在这些参数方面它非常稳健,如下面的模拟所示。它不仅可以处理单词列表的评分系统,还可以处理输入单词的用户的可能分类。处理可能有点技术性,但最终我们将有一个例程来计算分数作为 3 个数字的函数:列表中的单词数,数据库中完全匹配的单词数,以及部分匹配的人数(如yyyy
)。该例程是在 R 中实现的,但如果您从未使用过它,只需下载解释器,将代码复制并粘贴到它的控制台中,您就会看到此处显示的结果。
顺便说一句,英语不是我的第一语言,所以请耐心等待... :-)
一、型号规格:
有 3 类用户,分别命名为 I、II、III。我们假设每个单词列表都是由单个用户生成的,并且该用户是从一组用户中随机抽取的。我们说这个宇宙是 70% 的 I 类、25% 的 II 类和 5% 的 III 类。当然,这些数字是可以更改的。我们到目前为止
概率[用户=我] = 70%
概率[用户=II] = 25%
概率[用户=III] = 5%
给定用户,我们假设条件独立,即用户不会查看之前的单词来决定他是否会输入有效或无效的单词。
用户I倾向于只给出有效词,用户II只给出无效词,用户III是混合的。所以我们设置
概率[字=确定 | 用户=我] = 99%
概率[字=确定 | 用户=II] = 0.001%
概率[字=确定 | 用户=III] = 50%
给定用户的类别,单词无效的概率是互补的。请注意,我们给出了一个非常小但非零的 II 类用户输入有效单词的概率,因为即使是打字机前的猴子也会最终输入一个有效单词。
模型规范的最后一步是关于数据库。我们假设,对于每个单词,查询可能有 3 个结果:完全匹配、部分匹配(如yyyy
)或不匹配。在概率方面,我们假设
概率[匹配 | 有效] = 98%(并非所有有效词都会被找到)
概率[部分 | 有效] = 0.2%(罕见事件)
概率[匹配 | INvalid] = 0(数据库可能不完整,但没有无效词)
概率[部分 | 无效] = 0.1%(罕见事件)
不必设置找不到单词的概率,因为它们是互补的。就是这样,我们的模型设置好了。
2.符号和目的
我们有一个离散随机变量 U,取 {1, 2, 3} 中的值和两个离散随机向量 W 和 F,每个向量的大小为 n(= 单词数),其中 W_i 如果单词有效则为 1,如果单词有效,则 W_i 为 2如果单词无效,如果在数据库中找到该单词,则 F_i 为 1,如果部分匹配,则为 2,如果未找到,则为 3。
只有向量 F 是可观察的,其他的都是潜在的。使用贝叶斯定理和我们在模型规范中设置的分布,我们可以计算
(a) 概率[用户=我 | F],
即,给定观察到的匹配,用户属于 I 类的后验概率;和
(b) Prob[W=所有有效| F],
即,给定观察到的匹配,所有单词都是有效的后验概率。
根据您的目标,您可以使用一种或另一种作为评分解决方案。例如,如果您有兴趣区分真实用户和计算机程序,则可以使用 (a)。如果您只关心单词列表是否有效,则应使用 (b)。
我将尝试在下一节中简要解释该理论,但这是贝叶斯层次模型中的常用设置。参考文献是 Gelman (2004),“贝叶斯数据分析”。
如果需要,您可以使用代码跳转到第 4 节。
3. 数学
在这种情况下,我将像往常一样使用轻微的符号滥用,写作
p(x|y) 用于 Prob[X=x|Y=y] 和 p(x,y) 用于 Prob[X=x,Y=y]。
目标 (a) 是计算 p(u|f),因为 u=1。使用贝叶斯定理:
p(u|f) = p(u,f)/p(f) = p(f|u)p(u)/p(f)。
p(u) 是给定的。p(f|u) 来自:
p(f|u) = \prod_{i=1}^{n} \sum_{w_i=1}^{2} (p(f_i|w_i)p(w_i|u))
p(f|u) = \prod_{i=1}^{n} p(f_i|u)
= p(f_i=1|u)^(m) p(f_i=2|u)^(p) p(f_i=3)^(nmp)
其中 m = 匹配数, p = 部分匹配数。
p(f) 计算如下:
\sum_{u=1}^{3} p(f|u)p(u)
所有这些都可以直接计算出来。
目标 (b) 由下式给出
p(w|f) = p(f|w)*p(w)/p(f)
在哪里
p(f|w) = \prod_{i=1}^{n} p(f_i|w_i)
p(f_i|w_i) 在模型规范中给出。
p(f) 是上面计算的,所以我们只需要
p(w) = \sum_{u=1}^{3} p(w|u)p(u)
在哪里
p(w|u) = \prod_{i=1}^{n} p(w_i|u)
所以一切都准备好实施了。
4. 守则
代码编写为 R 脚本,常量在开头设置,根据上面讨论的内容,输出由函数给出
(a) p.u_f(u, n, m, p)
和
(b) p.wOK_f(n, m, p)
计算选项(a)和(b)的概率,给定输入:
u = 所需的用户类别(设置为 u=1)
n = 词数
m = 匹配
数 p = 部分匹配数
代码本身:
### Constants:
# User:
# Prob[U=1], Prob[U=2], Prob[U=3]
Prob_user = c(0.70, 0.25, 0.05)
# Words:
# Prob[Wi=OK|U=1,2,3]
Prob_OK = c(0.99, 0.001, 0.5)
Prob_NotOK = 1 - Prob_OK
# Database:
# Prob[Fi=match|Wi=OK], Prob[Fi=match|Wi=NotOK]:
Prob_match = c(0.98, 0)
# Prob[Fi=partial|Wi=OK], Prob[Fi=partial|Wi=NotOK]:
Prob_partial = c(0.002, 0.001)
# Prob[Fi=NOmatch|Wi=OK], Prob[Fi=NOmatch|Wi=NotOK]:
Prob_NOmatch = 1 - Prob_match - Prob_partial
###### First Goal: Probability of being a user type I, given the numbers of matchings (m) and partial matchings (p).
# Prob[Fi=fi|U=u]
#
p.fi_u <- function(fi, u)
{
unname(rbind(Prob_match, Prob_partial, Prob_NOmatch) %*% rbind(Prob_OK, Prob_NotOK))[fi,u]
}
# Prob[F=f|U=u]
#
p.f_u <- function(n, m, p, u)
{
exp( log(p.fi_u(1, u))*m + log(p.fi_u(2, u))*p + log(p.fi_u(3, u))*(n-m-p) )
}
# Prob[F=f]
#
p.f <- function(n, m, p)
{
p.f_u(n, m, p, 1)*Prob_user[1] + p.f_u(n, m, p, 2)*Prob_user[2] + p.f_u(n, m, p, 3)*Prob_user[3]
}
# Prob[U=u|F=f]
#
p.u_f <- function(u, n, m, p)
{
p.f_u(n, m, p, u) * Prob_user[u] / p.f(n, m, p)
}
# Probability user type I for n=1,...,5:
for(n in 1:5) for(m in 0:n) for(p in 0:(n-m))
{
cat("n =", n, "| m =", m, "| p =", p, "| Prob type I =", p.u_f(1, n, m, p), "\n")
}
##################################################################################################
# Second Goal: Probability all words OK given matchings/partial matchings.
p.f_wOK <- function(n, m, p)
{
exp( log(Prob_match[1])*m + log(Prob_partial[1])*p + log(Prob_NOmatch[1])*(n-m-p) )
}
p.wOK <- function(n)
{
sum(exp( log(Prob_OK)*n + log(Prob_user) ))
}
p.wOK_f <- function(n, m, p)
{
p.f_wOK(n, m, p)*p.wOK(n)/p.f(n, m, p)
}
# Probability all words ok for n=1,...,5:
for(n in 1:5) for(m in 0:n) for(p in 0:(n-m))
{
cat("n =", n, "| m =", m, "| p =", p, "| Prob all OK =", p.wOK_f(n, m, p), "\n")
}
5. 结果
这是 n=1,...,5 以及 m 和 p 的所有可能性的结果。例如,如果您有 3 个单词、一个匹配、一个部分匹配和一个未找到,您可以 66.5% 确定它是 I 类用户。在相同的情况下,您可以将 42.8% 的分数归因于所有单词都是有效的。
请注意,选项 (a) 不会对所有匹配的情况给出 100% 的分数,但选项 (b) 会。这是意料之中的,因为我们假设数据库没有无效词,因此如果它们都被找到,那么它们都是有效的。OTOH,II 类或 III 类用户输入所有有效单词的可能性很小,但随着 n 的增加,这种可能性会迅速降低。
(一个)
n = 1 | m = 0 | p = 0 | Prob type I = 0.06612505
n = 1 | m = 0 | p = 1 | Prob type I = 0.8107086
n = 1 | m = 1 | p = 0 | Prob type I = 0.9648451
n = 2 | m = 0 | p = 0 | Prob type I = 0.002062543
n = 2 | m = 0 | p = 1 | Prob type I = 0.1186027
n = 2 | m = 0 | p = 2 | Prob type I = 0.884213
n = 2 | m = 1 | p = 0 | Prob type I = 0.597882
n = 2 | m = 1 | p = 1 | Prob type I = 0.9733557
n = 2 | m = 2 | p = 0 | Prob type I = 0.982106
n = 3 | m = 0 | p = 0 | Prob type I = 5.901733e-05
n = 3 | m = 0 | p = 1 | Prob type I = 0.003994149
n = 3 | m = 0 | p = 2 | Prob type I = 0.200601
n = 3 | m = 0 | p = 3 | Prob type I = 0.9293284
n = 3 | m = 1 | p = 0 | Prob type I = 0.07393334
n = 3 | m = 1 | p = 1 | Prob type I = 0.665019
n = 3 | m = 1 | p = 2 | Prob type I = 0.9798274
n = 3 | m = 2 | p = 0 | Prob type I = 0.7500993
n = 3 | m = 2 | p = 1 | Prob type I = 0.9864524
n = 3 | m = 3 | p = 0 | Prob type I = 0.990882
n = 4 | m = 0 | p = 0 | Prob type I = 1.66568e-06
n = 4 | m = 0 | p = 1 | Prob type I = 0.0001158324
n = 4 | m = 0 | p = 2 | Prob type I = 0.007636577
n = 4 | m = 0 | p = 3 | Prob type I = 0.3134207
n = 4 | m = 0 | p = 4 | Prob type I = 0.9560934
n = 4 | m = 1 | p = 0 | Prob type I = 0.004198015
n = 4 | m = 1 | p = 1 | Prob type I = 0.09685249
n = 4 | m = 1 | p = 2 | Prob type I = 0.7256616
n = 4 | m = 1 | p = 3 | Prob type I = 0.9847408
n = 4 | m = 2 | p = 0 | Prob type I = 0.1410053
n = 4 | m = 2 | p = 1 | Prob type I = 0.7992839
n = 4 | m = 2 | p = 2 | Prob type I = 0.9897541
n = 4 | m = 3 | p = 0 | Prob type I = 0.855978
n = 4 | m = 3 | p = 1 | Prob type I = 0.9931117
n = 4 | m = 4 | p = 0 | Prob type I = 0.9953741
n = 5 | m = 0 | p = 0 | Prob type I = 4.671933e-08
n = 5 | m = 0 | p = 1 | Prob type I = 3.289577e-06
n = 5 | m = 0 | p = 2 | Prob type I = 0.0002259559
n = 5 | m = 0 | p = 3 | Prob type I = 0.01433312
n = 5 | m = 0 | p = 4 | Prob type I = 0.4459982
n = 5 | m = 0 | p = 5 | Prob type I = 0.9719289
n = 5 | m = 1 | p = 0 | Prob type I = 0.0002158996
n = 5 | m = 1 | p = 1 | Prob type I = 0.005694145
n = 5 | m = 1 | p = 2 | Prob type I = 0.1254661
n = 5 | m = 1 | p = 3 | Prob type I = 0.7787294
n = 5 | m = 1 | p = 4 | Prob type I = 0.988466
n = 5 | m = 2 | p = 0 | Prob type I = 0.00889696
n = 5 | m = 2 | p = 1 | Prob type I = 0.1788336
n = 5 | m = 2 | p = 2 | Prob type I = 0.8408416
n = 5 | m = 2 | p = 3 | Prob type I = 0.9922575
n = 5 | m = 3 | p = 0 | Prob type I = 0.2453087
n = 5 | m = 3 | p = 1 | Prob type I = 0.8874493
n = 5 | m = 3 | p = 2 | Prob type I = 0.994799
n = 5 | m = 4 | p = 0 | Prob type I = 0.9216786
n = 5 | m = 4 | p = 1 | Prob type I = 0.9965092
n = 5 | m = 5 | p = 0 | Prob type I = 0.9976583
(二)
n = 1 | m = 0 | p = 0 | Prob all OK = 0.04391523
n = 1 | m = 0 | p = 1 | Prob all OK = 0.836025
n = 1 | m = 1 | p = 0 | Prob all OK = 1
n = 2 | m = 0 | p = 0 | Prob all OK = 0.0008622994
n = 2 | m = 0 | p = 1 | Prob all OK = 0.07699368
n = 2 | m = 0 | p = 2 | Prob all OK = 0.8912977
n = 2 | m = 1 | p = 0 | Prob all OK = 0.3900892
n = 2 | m = 1 | p = 1 | Prob all OK = 0.9861099
n = 2 | m = 2 | p = 0 | Prob all OK = 1
n = 3 | m = 0 | p = 0 | Prob all OK = 1.567032e-05
n = 3 | m = 0 | p = 1 | Prob all OK = 0.001646751
n = 3 | m = 0 | p = 2 | Prob all OK = 0.1284228
n = 3 | m = 0 | p = 3 | Prob all OK = 0.923812
n = 3 | m = 1 | p = 0 | Prob all OK = 0.03063598
n = 3 | m = 1 | p = 1 | Prob all OK = 0.4278888
n = 3 | m = 1 | p = 2 | Prob all OK = 0.9789305
n = 3 | m = 2 | p = 0 | Prob all OK = 0.485069
n = 3 | m = 2 | p = 1 | Prob all OK = 0.990527
n = 3 | m = 3 | p = 0 | Prob all OK = 1
n = 4 | m = 0 | p = 0 | Prob all OK = 2.821188e-07
n = 4 | m = 0 | p = 1 | Prob all OK = 3.046322e-05
n = 4 | m = 0 | p = 2 | Prob all OK = 0.003118531
n = 4 | m = 0 | p = 3 | Prob all OK = 0.1987396
n = 4 | m = 0 | p = 4 | Prob all OK = 0.9413746
n = 4 | m = 1 | p = 0 | Prob all OK = 0.001109629
n = 4 | m = 1 | p = 1 | Prob all OK = 0.03975118
n = 4 | m = 1 | p = 2 | Prob all OK = 0.4624648
n = 4 | m = 1 | p = 3 | Prob all OK = 0.9744778
n = 4 | m = 2 | p = 0 | Prob all OK = 0.05816511
n = 4 | m = 2 | p = 1 | Prob all OK = 0.5119571
n = 4 | m = 2 | p = 2 | Prob all OK = 0.9843855
n = 4 | m = 3 | p = 0 | Prob all OK = 0.5510398
n = 4 | m = 3 | p = 1 | Prob all OK = 0.9927134
n = 4 | m = 4 | p = 0 | Prob all OK = 1
n = 5 | m = 0 | p = 0 | Prob all OK = 5.05881e-09
n = 5 | m = 0 | p = 1 | Prob all OK = 5.530918e-07
n = 5 | m = 0 | p = 2 | Prob all OK = 5.899106e-05
n = 5 | m = 0 | p = 3 | Prob all OK = 0.005810434
n = 5 | m = 0 | p = 4 | Prob all OK = 0.2807414
n = 5 | m = 0 | p = 5 | Prob all OK = 0.9499773
n = 5 | m = 1 | p = 0 | Prob all OK = 3.648353e-05
n = 5 | m = 1 | p = 1 | Prob all OK = 0.001494098
n = 5 | m = 1 | p = 2 | Prob all OK = 0.051119
n = 5 | m = 1 | p = 3 | Prob all OK = 0.4926606
n = 5 | m = 1 | p = 4 | Prob all OK = 0.9710204
n = 5 | m = 2 | p = 0 | Prob all OK = 0.002346281
n = 5 | m = 2 | p = 1 | Prob all OK = 0.07323064
n = 5 | m = 2 | p = 2 | Prob all OK = 0.5346423
n = 5 | m = 2 | p = 3 | Prob all OK = 0.9796679
n = 5 | m = 3 | p = 0 | Prob all OK = 0.1009589
n = 5 | m = 3 | p = 1 | Prob all OK = 0.5671273
n = 5 | m = 3 | p = 2 | Prob all OK = 0.9871377
n = 5 | m = 4 | p = 0 | Prob all OK = 0.5919764
n = 5 | m = 4 | p = 1 | Prob all OK = 0.9938288
n = 5 | m = 5 | p = 0 | Prob all OK = 1