0

我需要创建一个从文本生成列表的函数:

text = '^to[by, from] all ^appearances[appearance]'

list = ['to all appearances', 'to all appearance', 'by all appearances', 
        'by all appearance', 'from all appearances', 'from all appearance']

也就是说,括号内的值应该替换前面的单词,它紧跟在 ^ 之后。我希望函数有五个参数,如下所示...

我的代码(它不起作用)

def addSubstitution(buf, substitutions, val1='[', val2=']', dsym=',', start_p="^"):
    for i in range(1, len(buf), 2):
        buff = []
        buff.extend(buf)
        if re.search('''[^{2}]+[{0}][^{1}{0}]+?[{1}]'''.format(val1, val2, start_p,     buff[i]):
            substrs = re.split('['+val1+']'+'|'+'['+val2+']'+'|'+dsym, buff[i])
            for substr in substrs:
                if substr:
                    buff[i] = substr
                    addSubstitution(buff, substitutions, val1, val2, dsym, start_p)
        return
    substitutions.add(''.join(buf))
    pass

def getSubstitution(text, val1='[', val2=']', dsym=',', start_p="^"):
    pattern = '''[^{2}]+[{0}][^{1}{0}]+?[{1}]'''.format(val1, val2, start_p)
    texts = re.split(pattern,text)
    opttexts = re.findall(pattern,text)
    buff = []
    p = iter(texts)
    t = iter(opttexts)
    buf = []
    while True:
        try:
            buf.append(next(p))
            buf.append(next(t))
        except StopIteration:
            break
     substitutions = set()
     addSubstitution(buf, substitutions, val1, val2, dsym, start_p)
     substitutions = list(substitutions)
     substitutions.sort(key=len)
     return substitutions
4

1 回答 1

1

一种方法可能是这样(我正在跳过字符串操作代码):

text = '^to[by, from] all ^appearances[appearance]'

第 1 步:text像这样标记:

tokenizedText = ['^to[by, from]', 'all', '^appearances[appearance]']

第 2 步:准备所有需要笛卡尔积的单词列表(以 ^ 开头的单词)。

combinationList = []
for word in tokenizedText:
    if word[0] == '^': # split the words into a list, and add them to `combinationList`.

combinationList = [['to', 'by', 'from'], ['appearances', 'appearance']]

第 3 步:使用以下方法执行笛卡尔积itertools.product(...)

for substitution in itertools.product(*combinationList):
    counter = 0
    sentence = []
    for word in tokenizedInput:
        if word[0] == '^':
            sentence.append(substitution[counter])
            counter += 1
        else:
            sentence.append(word)
   print ' '.join(sentence)    # Or append this to a list if you want to return all substitutions.
于 2013-03-08T08:18:08.433 回答