algorithm - 使用动态编程实现文本对齐

Question

我正在尝试通过此处的 MIT OCW 课程了解动态编程的概念。OCW视频的解释很棒，但我觉得直到我将解释实现到代码中我才真正理解它。在实施时，我参考了这里的讲义中的一些笔记，特别是笔记的第 3 页。

问题是，我不知道如何将一些数学符号转换为代码。这是我实施的解决方案的一部分（并认为它实施正确）：

import math

paragraph = "Some long lorem ipsum text."
words = paragraph.split(" ")

# Count total length for all strings in a list of strings.
# This function will be used by the badness function below.
def total_length(str_arr):
    total = 0

    for string in str_arr:
        total = total + len(string)

    total = total + len(str_arr) # spaces
    return total

# Calculate the badness score for a word.
# str_arr is assumed be send as word[i:j] as in the notes
# we don't make i and j as argument since it will require
# global vars then.
def badness(str_arr, page_width):
    line_len = total_length(str_arr)
    if line_len > page_width:
        return float('nan') 
    else:
        return math.pow(page_width - line_len, 3)

现在我不明白的部分是讲义中的第 3 到第 5 点。我真的不明白，也不知道从哪里开始实施这些。到目前为止，我已经尝试迭代单词列表，并计算每个所谓的行尾的坏处，如下所示：

def justifier(str_arr, page_width):
    paragraph = str_arr
    par_len = len(paragraph)
    result = [] # stores each line as list of strings
    for i in range(0, par_len):
        if i == (par_len - 1):
            result.append(paragraph)
        else:
            dag = [badness(paragraph[i:j], page_width) + justifier(paragraph[j:], page_width) for j in range(i + 1, par_len + 1)] 
            # Should I do a min(dag), get the index, and declares it as end of line?

但是，我不知道如何继续该功能，老实说，我不明白这一行：

dag = [badness(paragraph[i:j], page_width) + justifier(paragraph[j:], page_width) for j in range(i + 1, par_len + 1)]

以及我将如何返回（justifier因为int我已经决定将返回值存储在result一个列表中。我应该创建另一个函数并从那里递归吗？应该有任何递归吗？

你能否告诉我下一步该做什么，并解释这是如何动态编程的？我真的看不到递归在哪里，以及子问题是什么。

之前谢谢。

score 23 · Accepted Answer

如果您无法理解动态编程本身的核心思想，这是我的看法：

动态编程本质上是为了时间复杂度而牺牲空间复杂度（但是与节省的时间相比，您使用的额外空间通常很少，如果正确实施，动态编程完全值得）。您可以随时存储每个递归调用的值（例如，在数组或字典中），这样当您在递归树的另一个分支中遇到相同的递归调用时，您可以避免第二次计算。

不，你不必使用递归。这是我对您正在使用循环解决的问题的实现。我非常密切地关注了 AlexSilva 链接的 TextAlignment.pdf。希望你觉得这很有帮助。

def length(wordLengths, i, j):
    return sum(wordLengths[i- 1:j]) + j - i + 1


def breakLine(text, L):
    # wl = lengths of words
    wl = [len(word) for word in text.split()]

    # n = number of words in the text
    n = len(wl)    

    # total badness of a text l1 ... li
    m = dict()
    # initialization
    m[0] = 0    

    # auxiliary array
    s = dict()

    # the actual algorithm
    for i in range(1, n + 1):
        sums = dict()
        k = i
        while (length(wl, k, i) <= L and k > 0):
            sums[(L - length(wl, k, i))**3 + m[k - 1]] = k
            k -= 1
        m[i] = min(sums)
        s[i] = sums[min(sums)]

    # actually do the splitting by working backwords
    line = 1
    while n > 1:
        print("line " + str(line) + ": " + str(s[n]) + "->" + str(n))
        n = s[n] - 1
        line += 1

score 16 · Accepted Answer

对于其他对此仍然感兴趣的人：关键是从文本的末尾向后移动（如此处所述）。如果这样做，您只需比较已经记住的元素。

说，words是要根据包装的字符串列表textwidth。然后，在讲座的符号中，任务减少到三行代码：

import numpy as np

textwidth = 80

DP = [0]*(len(words)+1)

for i in range(len(words)-1,-1,-1):
    DP[i] = np.min([DP[j] + badness(words[i:j],textwidth) for j in range(i+1,len(words)+1)])

和：

def badness(line,textwidth):

    # Number of gaps
    length_line = len(line) - 1

    for word in line:
        length_line += len(word)

    if length_line > textwidth: return float('inf')

    return ( textwidth - length_line )**3

他提到可以添加第二个列表来跟踪中断位置。您可以通过将代码更改为：

DP = [0]*(len(words)+1)
breaks = [0]*(len(words)+1)

for i in range(len(words)-1,-1,-1):
    temp = [DP[j] + badness(words[i:j],args.textwidth) for j in range(i+1,len(words)+1)]

    index = np.argmin(temp)

    # Index plus position in upper list
    breaks[i] = index + i + 1
    DP[i] = temp[index]

要恢复文本，只需使用中断位置列表：

def reconstruct_text(words,breaks):                                                                                                                

    lines = []
    linebreaks = []

    i = 0 
    while True:

        linebreaks.append(breaks[i])
        i = breaks[i]

        if i == len(words):
            linebreaks.append(0)
            break

    for i in range( len(linebreaks) ):
        lines.append( ' '.join( words[ linebreaks[i-1] : linebreaks[i] ] ).strip() )

    return lines

结果： ( text = reconstruct_text(words,breaks))

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam
voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet
clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit
amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed
diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet
clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

人们可能会想添加一些空格。这非常棘手（因为可能会提出各种审美规则），但天真的尝试可能是：

import re

def spacing(text,textwidth,maxspace=4):

    for i in range(len(text)):

        length_line = len(text[i])

        if length_line < textwidth:

            status_length = length_line
            whitespaces_remain = textwidth - status_length
            Nwhitespaces = text[i].count(' ')

            # If whitespaces (to add) per whitespace exeeds
            # maxspace, don't do anything.
            if whitespaces_remain/Nwhitespaces > maxspace-1:pass
            else:
                text[i] = text[i].replace(' ',' '*( 1 + int(whitespaces_remain/Nwhitespaces)) )
                status_length = len(text[i])

                # Periods have highest priority for whitespace insertion
                periods = text[i].split('.')

                # Can we add a whitespace behind each period?
                if len(periods) - 1 + status_length <= textwidth:
                    text[i] = '. '.join(periods).strip()

                status_length = len(text[i])
                whitespaces_remain = textwidth - status_length
                Nwords = len(text[i].split())
                Ngaps = Nwords - 1

                if whitespaces_remain != 0:factor = Ngaps / whitespaces_remain

                # List of whitespaces in line i
                gaps = re.findall('\s+', text[i])

                temp = text[i].split()
                for k in range(Ngaps):
                    temp[k] = ''.join([temp[k],gaps[k]])

                for j in range(whitespaces_remain):
                    if status_length >= textwidth:pass
                    else:
                        replace = temp[int(factor*j)]
                        replace = ''.join([replace, " "])
                        temp[int(factor*j)] = replace

                text[i] = ''.join(temp)

    return text

什么给你：（text = spacing(text,textwidth)）

Lorem  ipsum  dolor  sit  amet, consetetur  sadipscing  elitr,  sed  diam nonumy
eirmod  tempor  invidunt  ut labore  et  dolore  magna aliquyam  erat,  sed diam
voluptua.   At  vero eos  et accusam  et justo  duo dolores  et ea  rebum.  Stet
clita  kasd  gubergren,  no  sea  takimata sanctus  est  Lorem  ipsum  dolor sit
amet.   Lorem  ipsum  dolor  sit amet,  consetetur  sadipscing  elitr,  sed diam
nonumy  eirmod  tempor invidunt  ut labore  et dolore  magna aliquyam  erat, sed
diam  voluptua.  At vero eos et accusam et  justo duo dolores et ea rebum.  Stet
clita  kasd gubergren, no sea  takimata sanctus est Lorem  ipsum dolor sit amet.

score 1 · Accepted Answer

我刚看了讲座，想把我能理解的东西放在这里。我已经以与提问者类似的格式输入了代码。正如讲座中解释的那样，我在这里使用了递归。
第 3 点，定义重复。这基本上是一个接近的底部，您可以在其中计算与较高输入有关的函数的值，然后使用它来计算较低值输入的值。
讲座将其解释为：
DP(i) = min(DP(j) + badness(i, j))
for j，从 i+1 到 n 变化。
在这里，i 从 n 到 0 变化（从下到上！）。
由于 DP(n) = 0 ，
DP(n-1) = DP(n) + badness(n-1, n)
然后你从 D(n-1) 和 D(n) 计算 D(n-2)并从中取出最少的东西。
这样你就可以一直下降到 i=0，这就是 badness 的最终答案！
如您所见，在第 4 点中，这里有两个循环。一个用于 i，另一个在 i 内用于 j。
因此，当 i=0, j(max) = n, i = 1, j(max) = n-1, ... i = n , j(max) = 0。
因此总时间 = 这些相加 = n (n+1)/2。
因此 O(n^2)。
第 5 点只是确定解决方案 DP[0]!
希望这可以帮助！

import math

justification_map = {}
min_map = {}

def total_length(str_arr):
    total = 0

    for string in str_arr:
        total = total + len(string)

    total = total + len(str_arr) - 1 # spaces
    return total

def badness(str_arr, page_width):
    line_len = total_length(str_arr)
    if line_len > page_width:
        return float('nan') 
    else:
        return math.pow(page_width - line_len, 3)

def justify(i, n, words, page_width):
    if i == n:

        return 0
    ans = []
    for j in range(i+1, n+1):
        #ans.append(justify(j, n, words, page_width)+ badness(words[i:j], page_width))
        ans.append(justification_map[j]+ badness(words[i:j], page_width))
    min_map[i] = ans.index(min(ans)) + 1
    return min(ans)

def main():
    print "Enter page width"
    page_width = input()
    print "Enter text"
    paragraph = input() 
    words = paragraph.split(' ')
    n = len(words)
    #justification_map[n] = 0 
    for i in reversed(range(n+1)):
        justification_map[i] = justify(i, n, words, page_width)

    print "Minimum badness achieved: ", justification_map[0]

    key = 0
    while(key <n):
        key = key + min_map[key]
        print key

if __name__ == '__main__':
    main()

score 1 · Accepted Answer

Java 实现给定最大线宽为 L，证明文本 T 的想法是考虑文本的所有后缀（考虑单词而不是字符来形成后缀是精确的。）动态编程只不过是“小心蛮力” ”。如果您考虑蛮力方法，则需要执行以下操作。

考虑将 1, 2, .. n 个单词放在第一行。
对于案例 1 中描述的每个案例（比如 i 单词放在第 1 行），考虑将 1、2、.. n -i 个单词放在第二行，然后将剩余单词放在第三行等等的情况。

相反，让我们只考虑问题，找出将单词放在行首的成本。一般来说，我们可以将 DP(i) 定义为将第 (i-1) 个单词视为行首的成本。

我们如何形成 DP(i) 的递推关系？

如果第 j 个单词是下一行的开头，那么当前行将包含 words[i:j)（不包括 j），并且第 j 个单词作为下一行开头的成本将为 DP(j)。因此 DP(i) = DP(j) + 将 words[i:j) 放入当前行的成本因为我们想要最小化总成本，DP(i) 可以定义如下。

复发关系：

DP(i) = min { DP(j) + 为 [i+1, n] 中的所有 j 放入 words[i:j in the current line } 的成本

注意 j = n 表示下一行没有任何单词可以放置。

基本情况：DP(n) = 0 => 此时没有字可写。

总结一下：

子问题： suffixes , words[:i]
猜测：从哪里开始下一行，# of choice n - i -> O(n)
重复： DP(i) = min {DP(j) + cost of put words[i:j) in the current line } 如果我们使用记忆，大括号内的表达式应该花费 O(1) 时间，并且循环运行 O(n) 次（选择次数#）。i 从 n 变化到 0 => 因此总复杂度降低到 O(n^2)。

现在即使我们导出了证明文本的最小成本，我们还需要通过跟踪上面表达式中选择为最小值的 j 值来解决原始问题，以便以后可以使用相同的值来打印证明的文本。这个想法是保持父指针。

希望这可以帮助您了解解决方案。下面是上述思想的简单实现。

 public class TextJustify {
    class IntPair {
        //The cost or badness
        final int x;

        //The index of word at the beginning of a line
        final int y;
        IntPair(int x, int y) {this.x=x;this.y=y;}
    }
    public List<String> fullJustify(String[] words, int L) {
        IntPair[] memo = new IntPair[words.length + 1];

        //Base case
        memo[words.length] = new IntPair(0, 0);


        for(int i = words.length - 1; i >= 0; i--) {
            int score = Integer.MAX_VALUE;
            int nextLineIndex = i + 1;
            for(int j = i + 1; j <= words.length; j++) {
                int badness = calcBadness(words, i, j, L);
                if(badness < 0 || badness == Integer.MAX_VALUE) break;
                int currScore = badness + memo[j].x;
                if(currScore < 0 || currScore == Integer.MAX_VALUE) break;
                if(score > currScore) {
                    score = currScore;
                    nextLineIndex = j;
                }
            }
            memo[i] = new IntPair(score, nextLineIndex);
        }

        List<String> result = new ArrayList<>();
        int i = 0;
        while(i < words.length) {
            String line = getLine(words, i, memo[i].y);
            result.add(line);
            i = memo[i].y;
        }
        return result;
    }

    private int calcBadness(String[] words, int start, int end, int width) {
        int length = 0;
        for(int i = start; i < end; i++) {
            length += words[i].length();
            if(length > width) return Integer.MAX_VALUE;
            length++;
        }
        length--;
        int temp = width - length;
        return temp * temp;
    }


    private String getLine(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for(int i = start; i < end - 1; i++) {
            sb.append(words[i] + " ");
        }
        sb.append(words[end - 1]);

        return sb.toString();
    }
  }

score 0 · Accepted Answer

根据你的定义，我是这么认为的。

import math

class Text(object):
    def __init__(self, words, width):
        self.words = words
        self.page_width = width
        self.str_arr = words
        self.memo = {}

    def total_length(self, str):
        total = 0
        for string in str:
            total = total + len(string)
        total = total + len(str) # spaces
        return total

    def badness(self, str):
        line_len = self.total_length(str)
        if line_len > self.page_width:
            return float('nan') 
        else:
            return math.pow(self.page_width - line_len, 3)

    def dp(self):
        n = len(self.str_arr)
        self.memo[n-1] = 0

        return self.judge(0)

    def judge(self, i):
        if i in self.memo:
            return self.memo[i]

        self.memo[i] = float('inf') 
        for j in range(i+1, len(self.str_arr)):
            bad = self.judge(j) + self.badness(self.str_arr[i:j])
            if bad < self.memo[i]:
                self.memo[i] = bad

        return self.memo[i]

algorithm - 使用动态编程实现文本对齐

5 回答 5

Related

Reference