python - 解析 Moses 配置文件

Question

给定来自Moses Machine Translation Toolkit的配置文件：

#########################
### MOSES CONFIG FILE ###
#########################

# input factors
[input-factors]
0

# mapping steps
[mapping]
0 T 0

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/home/gillin/jojomert/phrase-jojo/work.src-ref/training/model/phrase-table.gz input-factor=0 output-factor=0
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home/gillin/jojomert/phrase-jojo/work.src-ref/training/model/reordering-table.wbe-msd-bidirectional-fe.gz
Distortion
KENLM lazyken=0 name=LM0 factor=0 path=/home/gillin/jojomert/ru.kenlm order=5

# dense weights for feature functions
[weight]
UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5

我需要从该[weights]部分读取参数：

UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5

我一直在这样做：

def read_params_from_moses_ini(mosesinifile):
    parameters_string = ""
    for line in reversed(open(mosesinifile, 'r').readlines()):
        if line.startswith('[weight]'):
            return parameters_string
        else:
            parameters_string+=line.strip() + ' '

得到这个输出：

LM0= 0.5 Distortion0= 0.3 LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3 TranslationModel0= 0.2 0.2 0.2 0.2 PhrasePenalty0= 0.2 WordPenalty0= -1 UnknownWordPenalty0= 1

然后使用解析输出

moses_param_pattern = re.compile(r'''([^\s=]+)=\s*((?:[^\s=]+(?:\s|$))*)''')

def parse_parameters(parameters_string):
    return dict((k, list(map(float, v.split())))
                   for k, v in moses_param_pattern.findall(parameters_string))


 mosesinifile = 'mertfiles/moses.ini'

 print (parse_parameters(read_params_from_moses_ini(mosesinifile)))

要得到：

{'UnknownWordPenalty0': [1.0], 'PhrasePenalty0': [0.2], 'WordPenalty0': [-1.0], 'Distortion0': [0.3], 'LexicalReordering0': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3], 'TranslationModel0': [0.2, 0.2, 0.2, 0.2], 'LM0': [0.5]}

当前的解决方案涉及从配置文件中读取一些疯狂的反转行，然后读取非常复杂的正则表达式以获取参数。

是否有更简单或更简洁/冗长的方式来读取文件并实现所需的参数字典输出？

是否可以更改 configparser 以使其读取 moses 配置文件？这很困难，因为它有一些实际上是参数的错误部分，例如[distortion-limit]，没有 key 到 value 6。在经过验证的 configparse-able 文件中，它应该是distortion-limit = 6.

注意：本机 pythonconfigparser无法处理moses.ini配置文件。如何使用 Python3 读写 INI 文件的答案？不管用。

score 1 · Accepted Answer

你可以简单地做到这一点。

x="""#########################
### MOSES CONFIG FILE ###
#########################

# input factors 
[input-factors]
0

# mapping steps
[mapping]
0 T 0

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/home    /gillin/jojomert/phrase-jojo/work.src-ref/training/model/phrase-table.gz input-factor=0 output-factor=0
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home/gillin/jojomert/phrase-jojo/work.src-ref/training/model/reordering-table.wbe-msd-bidirectional-fe.gz
Distortion
KENLM lazyken=0 name=LM0 factor=0 path=/home/gillin/jojomert/ru.kenlm      order=5

# dense weights for feature functions
[weight]
UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3
Distortion0= 0.3
LM0= 0.5"""

print [(i,j.split()) for i,j in re.findall(r"([^\s=]+)=\s*([\d.\s]+(?<!\s))",re.findall(r"\[weight\]([\s\S]*?)(?:\n\[[^\]]*\]|$)",x)[0])]

输出：[('UnknownWordPenalty0', ['1']), ('PhrasePenalty0', ['0.2']), ('TranslationModel0', ['0.2', '0.2', '0.2', '0.2']), ('LexicalReordering0', ['0.3', '0.3', '0.3', '0.3', '0.3', '0.3']), ('Distortion0', ['0.3']), ('LM0', ['0.5'])] `

score 1 · Accepted Answer

这是另一个基于正则表达式的简短解决方案，它返回与您的输出类似的值的字典：

import re
from collections import defaultdict

dct = {}

str="MOSES_INI_FILE_CONTENTS"

#get [weight] section
match_weight = re.search(r"\[weight][^\n]*(?:\n(?!$|\n)[^\n]*)*", str) # Regex is identical to "(?s)\[weight].*?(?:$|\n\n)"
if match_weight:
    weight = match_weight.group() # get the [weight] text
    dct = dict([(x[0], [float(x) for x in x[1].split(" ")]) for x in  re.findall(r"(\w+)\s*=\s*(.*)\s*", weight)])

print dct

见IDEONE 演示

生成的字典内容：

{'UnknownWordPenalty0': [1.0], 'LexicalReordering0': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3], 'LM0': [0.5], 'PhrasePenalty0': [0.2], 'TranslationModel0': [0.2, 0.2, 0.2, 0.2], 'Distortion0': [0.3], 'WordPenalty0': [-1.0]}

逻辑：

从文件中取出[weight]块。它可以用一个字面r"\[weight][^\n]*(?:\n(?!$|\n)[^\n]*)*"匹配的正则表达式来完成[weight]，然后它匹配每个字符任意多次，直到一个双\n符号（正则表达式使用展开循环技术并且适用于跨越几行的较长文本）。相同的基于惰性点的正则表达式是 [ r"(?s)\[weight].*?(?:$|\n\n)"]，但效率不高（第一个正则表达式需要 62 步，而第二个正则表达式需要 528 步才能在当前 MOSES.ini 文件中找到匹配项），但绝对更具可读性。
运行搜索后，检查匹配项。如果找到匹配项，则运行该re.findall(r"(\w+)\s*=\s*(.*)\s*", weight)方法以收集所有键值对。使用的正则表达式是一个简单的(\w+)\s*=\s*(.*)\s*匹配并将一个或多个字母数字符号（字符串的结尾。带有后续 sapces 的尾随换行符用 final 修剪。(\w+)=\s*=\s*\s*
在收集键和值时，后者可以返回为使用理解解析为浮点值的数字列表。

score 1 · Accepted Answer

如果没有正则表达式，您可以执行以下操作：

flag = False
result = dict()

with open('moses.ini', 'rb') as fh:
    for line in fh:
        if flag:
            parts = line.rstrip().split('= ')
            if len(parts) == 2:
                result[parts[0]] = [float(x) for x in parts[1].split()]
            else:
                break
        elif line.startswith('[weight]'):
            flag = True

print(result)

文件在循环中逐行读取，当[weight]到达时，标志设置为True并为所有下一行提取键/值，直到空行或文件末尾。

这样，只有当前行被加载到内存中，一旦[weight]到达块的末尾，程序就会停止读取文件。

另一种使用方式itertools：

from itertools import *

result = dict()

with open('moses.ini', 'rb') as fh:
    a = dropwhile(lambda x: not(x.startswith('[weight]')), fh)
    a.next()
    for k,v in takewhile(lambda x: len(x)==2, [y.rstrip().split('= ') for y in a]):
        result[k] = [float(x) for x in v.split()]

print(result)

python - 解析 Moses 配置文件

3 回答 3

Related

Reference