nlp - 用最重的词拆分句子

Question

我正在开发一款游戏，我需要找到特定句子的最大权重。

假设我有句子“the quick brown fox”，并假设只有单个单词具有定义的权重：“the”-> 10、“quick”-> 5、“brown”-> 3、“fox”-> 8

在这种情况下，问题是微不足道的，因为解决方案包括添加每个单词的权重。

现在假设我们还添加了双词，所以除了上面的词，我们还有 "the quick" -> 5, "quick brown" -> 10, "brown fox" -> 1

我想知道单字和双字的哪个组合提供了最大的权重，在这种情况下是“the”、“quick brown”、“fox”

我的问题是，除了明显的蛮力方法之外，还有其他可能的方法来获得解决方案吗？不用说，我正在寻找一些最佳方法来实现更大的句子。

谢谢你。

score 3 · Accepted Answer

You can look at Integer Linear Program libraries like lp_solve. In this case, you will want to maximize the scores, and your objective function will contain the weights. Then you can subject it to constraints, like you cannot have "quick brown" and "brown" at the same time.

For word alignment, this was used in this paper, but your problem is way simpler than that, but you can browse through the paper to get an idea on how ILP was used. There's probably other algorithms other than ILP that can be used to solve this optimally, but ILP can solve it optimally and efficiently for small problems.

score 1 · Accepted Answer

"the" -> 10, "quick" -> 5, "brown" -> 3, "fox" -> 8 对于上面的单个单词，我将取一个数组 [10,5,3,8] 来表示单词0,1,2,3 遍历列表，如果两个分数的组合小于组合分数，例如 10+5 >5 the + quick >the quick 5+3 < 10 quick brown > quick + brown 。标记这个等等

在标记组合解决方案时，沿连续范围标记它们。例如，如果单词分数是 words = [1,2,5,3,1,4,6,2,6,8] 和 [4,6,9,7,8,2,9,1,2] 标记范围（包括两端）是 [0,1],[2,5],[6,7]

伪代码如下

从 0 遍历到 word length - 1

if number not in range :
add word[number] to overall sum.
else:
if length of range = 1 :
    add combined_word_score [ lower_end_number]
 else if length of range = 2 :
    add combined_word_score [ lower_end_number+next number]
else if length of range > 2 and is odd number :
    add max (alternate_score_starting at lower_end_number , 
              word[lower_end]+word[higher_end]+alternate_score_starting at 
                next_number)
else if length of range > 2 and is even number :
     add max (alternate_score_starting at lower_end_number +word[higher_end],
              word[lower_end]+alternate_score_starting at 
                next_number).

score 0 · Accepted Answer

这感觉像是一个动态编程问题。

我可以想象句子的 k 个单词并排放置，每个单词之间都有一个灯泡（即总共 k-1 个灯泡）。如果一个灯泡打开，这意味着它旁边的单词是一个短语的一部分，如果它关闭，它们不是。因此，这些灯泡的任何配置都表示可能的权重组合。当然，许多配置甚至是不可能的，因为我们没有他们需要的短语的任何分数。所以 k-1 个灯泡意味着我们最多有 2^(k-1) 个可能的答案可供我们通过。

我们可以认识到每个计算的某些部分可以重用于其他计算，而不是蛮力强制它，所以对于 (The)(quick)(brown fox ... lazy dog) 和 (The quick)(brown fox 。 ..lazy dog)，我们可以只计算一次 (brown fox ...lazy dog) 的最高分，记住它并在下次看到它时重复使用它而无需做任何额外的工作。

在我们开始之前，我们应该首先去掉只能有 1 个可能值的灯泡（假设我们没有短语 'brown fox' 或任何包含该短语的更大的短语，然后是 'brown 之间的灯泡' 和 'fox' 总是必须关闭）.. 每个移除的灯泡都会将解决方案空间减半。

如果 w1、w2、w3 是单词，那么灯泡就是 w1w2、w2w3、w3w4 等。所以

Optimal(w1w2 w2w3 w3w4 ...) = max(Optimal(w2w3 w3w4 ...) given w1w2 is on, Optimal(w2w3 w3w4 ...) given w1w2 is off)

（警告如果我们遇到无法解决的问题，我们只需返回 MIN_INT 并且事情应该会解决）

我们可以像这样解决问题，但是如果我们对接近灯泡的顺序很聪明，我们可能会节省更多时间。也许先攻击中心灯泡可能会有所帮助。我不确定这部分。

nlp - 用最重的词拆分句子

3 回答 3

Related

Reference