2

让我们假设我有以下段落:

"This is the first sentence. This is the second sentence? This is the third
 sentence!"

我需要创建一个函数,它只返回给定字符数下的句子数。如果小于一个句子,它将返回第一个句子的所有字符。

例如:

>>> reduce_paragraph(100)
"This is the first sentence. This is the second sentence? This is the third
 sentence!"

>>> reduce_paragraph(80)
"This is the first sentence. This is the second sentence?"

>>> reduce_paragraph(50)
"This is the first sentence."

>>> reduce_paragraph(5)
"This "

我从这样的事情开始,但我似乎无法弄清楚如何完成它:

endsentence = ".?!"
sentences = itertools.groupby(text, lambda x: any(x.endswith(punct) for punct in endsentence))
for number,(truth, sentence) in enumerate(sentences):
    if truth:
        first_sentence = previous+''.join(sentence).replace('\n',' ')
    previous = ''.join(sentence)
4

4 回答 4

6

由于英语的句法结构,处理句子非常困难。正如有人已经指出的那样,即使是最好的正则表达式,缩写之类的问题也会导致无休止的头痛。

您应该考虑Natural Laungauge Toolkit。特别是punkt模块。它是一个句子标记器,它将为您完成繁重的工作。

于 2012-08-19T22:40:45.570 回答
2

以下是使用@BigHandsome 提到punkt的模块截断段落的方法:

from nltk.tokenize.punkt import PunktSentenceTokenizer

def truncate_paragraph(text, maxnchars,
                       tokenize=PunktSentenceTokenizer().span_tokenize):
    """Truncate the text to at most maxnchars number of characters.

    The result contains only full sentences unless maxnchars is less
    than the first sentence length.
    """
    sentence_boundaries = tokenize(text)
    last = None
    for start_unused, end in sentence_boundaries:
        if end > maxnchars:
            break
        last = end
    return text[:last] if last is not None else text[:maxnchars]

例子

text = ("This is the first sentence. This is the second sentence? "
        "This is the third\n sentence!")
for limit in [100, 80, 50, 5]:
    print(truncate_paragraph(text, limit))

输出

这是第一句话。这是第二句?这是第三个
 句子!
这是第一句话。这是第二句?
这是第一句话。
这
于 2012-08-19T23:45:41.770 回答
0

如果我们忽略自然语言问题(即返回由“.?!”分隔的完整块的算法,其中总和小于 k),那么以下基本方法将起作用:

def sentences_upto(paragraph, k):
    sentences = []
    current_sentence = ""
    stop_chars = ".?!"
    for i, c in enumerate(paragraph):
        current_sentence += c
        if(c in stop_chars):
            sentences.append(current_sentence)
            current_sentence = ""
        if(i == k):
            break
    return sentences
        return sentences

您的 itertools 解决方案可以这样完成:

def sentences_upto_2(paragraph, size):
    stop_chars = ".?!"
    sentences = itertools.groupby(paragraph, lambda x: any(x.endswith(punct) for punct in stop_chars))  
    for k, s in sentences:
        ss = "".join(s)
        size -= len(ss)
        if not k:
            if size < 0:
                return
            yield ss
于 2012-08-19T22:43:47.477 回答
0

您可以将此问题分解为更简单的步骤:

  1. 给定一个段落,将其拆分为句子
  2. 弄清楚我们可以在不超过字符限制的情况下连接多少个句子
  3. 如果我们至少可以容纳一个句子,那么将这些句子连接在一起。
  4. 如果第一句话太长,就把第一句话删掉。

示例代码(未测试):

    def reduce_paragraph(para, max_len):
        # Split into list of sentences
        # A sentence is a sequence of characters ending with ".", "?", or "!".
        sentences = re.split(r"(?<=[\.?!])", para)

        # Figure out how many sentences we can have and stay under max_len
        num_sentences = 0
        total_len = 0
        for s in sentences:
            total_len += len(s)
            if total_len > max_len:
                break
            num_sentences += 1

        if num_sentences > 0:
            # We can fit at least one sentence, so return whole sentences
            return ''.join(sentences[:num_sentences])
        else:
            # Return a truncated first sentence
            return sentences[0][:max_len]
于 2012-08-20T00:02:00.220 回答