2

如何减少多余的文本?例如,我有 2 个输入,我需要将它们减少到以下输出

输入1:丰田是红色的。本田是红色的。宝马是红色的。奔驰是绿色的。

输出 1:丰田、本田和宝马是红色的,而奔驰是绿色的。

输入 2:丰田是红色的。本田是红色的。宝马是红色的。奔驰是红色的。

输出 2:所有汽车都是红色的。

我假设这是一个 NLP 问题。理想情况下,我希望在 Python 中执行此操作(但任何其他语言也很好,首选开发简单)

4

2 回答 2

3

正如我对您的问题所评论的那样:我认为首先您应该定义某种语法。例如一个简单的赋值语句:The <variable> is <value>.. 我为你的第一行创建了一个小例子,我想你会明白这个想法,所以你也可以为下一行做:

import re

def reducer(text):  # Catch statements and add them to a dictionary
    catched = dict()
    for v, k in re.findall(r'The\s(?P<variable>\w+)\sis\s(?P<value>\w+)', text):
        try:
            catched[k].append(v)
        except KeyError:
            catched[k] = [v]
    return catched

def comma_and(ls):  # Create human-like enumeration with words
    output = [ls[0]]
    for i in xrange(1, len(ls)):
        output.append('%s%s' % (' and ' if i == len(ls) - 1 else ', ', ls[i]))
    return ''.join(output)

def rephrase(text):  # Rephrase separated statements into one sentence
    stmnts = reducer(text)
    part1 = str()
    part2 = str()
    for key in stmnts:
        if len(stmnts[key]) <= 1:
            part2 = 'but the {variable} is {value}.'.format(
                variable=stmnts[key][0], value=key)
        else:
            part1 = 'The {variables} are {value}'.format(
                variables=comma_and(stmnts[key]), value=key)
    print part1 + ' ' + part2

演示:

rephraser('The Toyota is red. The Honda is red. The BMW is red. The Mercedes is green.')

输出是:

# The Toyota, Honda and BMW are red but the Mercedes is green.

您的下一行类似于:检查字典catched是否只有一个键,如果该键中的值都是汽车,请使用All <type> are <value>.

于 2013-05-25T23:16:14.443 回答
2

有趣的案例!我做了类似以下的事情

import re, sys, random
def main():
    sentence = get_sentence()
    print("Simplifying sentence:")
    print(""" \" %s \" """ % sentence )
    mapped_elements = map_colors(sentence)
    # 3 possible casses
    number_of_parts = len(mapped_elements)
    if number_of_parts == 1:
        color = mapped_elements[0][0]
        elements = mapped_elements[0][1]
        if elements == 1:
            simplified = "The %s is %s." % (elements, color)
        else:
            simplified = "All cars are %s." % (color)

    elif number_of_parts == 2:
        part1 = mapped_elements[0]
        part2 = mapped_elements[1]

        part1_sentence = "The %s %s %s" % (gather_elements(part1[1]) ,pluralize(len(part1[1])),part1[0])
        part2_sentence = "the %s %s %s" % (gather_elements(part2[1]), pluralize(len(part2[1])),part2[0])
        simplified = "%s but %s." % (part1_sentence, part2_sentence)
    else:
        all_parts = []
        for color, nouns in mapped_elements:
            part_sentence = "The %s %s %s. " % (gather_elements(nouns) , pluralize(len(nouns)), color)
            all_parts.append(part_sentence)
        simplified = ''.join(all_parts)

    print("Result: %s" % simplified)
    return simplified

def pluralize(n):
    if n == 1: return "is"
    return "are"

def get_sentence():
    if len(sys.argv) > 1:
        sentence = sys.argv[1]
    elif test_mode:
        sentence = random.choice(sample_test)
    else:
        sentence = raw_input("Please enter a sentence to simplify: ")
    return sentence



def gather_elements(elements):
    if len(elements) == 1: 
        return elements[0]
    partial = ', '.join(elements[:-1])
    partial += " and %s " % (elements[-1])
    return partial

def map_colors(sentence):
    colors = {}
    r = re.compile(r'The (\w+) is (\w+)\.')
    pairs = re.findall( r, sentence)
    for noun, color in pairs:
        if color in colors:
            colors[color].append(noun)
        else:
            colors[color] = [noun]
    return colors.items()
于 2013-05-25T23:52:42.840 回答