python - 替换逗号分隔字符串中间的下划线分隔子字符串

Question

我有一个包含多行的文件，如下所示：

 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}

我想用另一个号码替换 1371078139195（在这种情况下）。我要替换的值始终位于第一个逗号分隔的单词中，并且始终是该单词中倒数第二个下划线分隔的值。以下是我这样做的方式并且它有效，但这似乎不合时宜且笨拙。

>>> line="'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> l1=",".join(line.split(",")[1:])
>>> print l1
 {'cf:rv': '0'}
>>> l2=line.split(",")[0]
>>> print l2
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442'
>>> print "_".join(l2.split('_')[:-2])
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight
>>>
>>> print "_".join(l2.split('_')[:-2])+ "_1234567_"+(l2.split('_')[-1])
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442'
>>> print "_".join(l2.split('_')[:-2])+ "_1234567_"+(l2.split('_')[-1]) + "," + l1
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}
>>>

是否有更简单的方法来替换（可能使用正则表达式）该值？我无法想象这是最好的方法

我有几个答案，我必须强调它是倒数第二个强调值。以下是有效的字符串：

line = "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
line = "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}"
line = "'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}"

在上述情况下，字符串中有一个数字字符串不在倒数第二个下划线之后。最后一部分可能是也可能不是全数字（可能是+14155186442，也可能是14155186442）。对不起，我没有在上面提到这一点。

一个

score 4 · Accepted Answer

使用正则表达式：

m = re.match("([^,]*_)([+]?[0-9]+)(_.*)", s)
if m:
    before = m.group(1)
    number = m.group(2)
    after = m.group(3)
    s = before + new_number(number) + after

意思是

[^,]*_= 你想要多少个字符但不是逗号，后跟一个下划线
[+]?[0-9]+= 数字，前面可选+
_.*= 下划线后跟任何内容

这是有效的，因为正则表达式匹配默认是“贪婪的”，所以[^,]*实际上会使用所有下划线，在倒数第二个之前停止以使匹配成功。

例如，如果您需要而不是倒数第二个下划线分隔您需要倒数第三个，则表达式可以更改为

m = re.match("([^,]*_)([+]?[0-9]+)(_[^,]*_.*)", s)

因此要求数字后的逗号前至少有两个下划线。

score 3 · Accepted Answer

非正则表达式解决方案：

>>> strs = " 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> first, sep, rest = strs.partition(',')
>>> lis = first.rsplit('_', 2)
>>> lis[1] = "1111111"
>>> "_".join(lis) + sep + rest
" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1111111_+14155186442', {'cf:rv': '0'}"

功能：

def solve(strs, rep):                                                                                                   first, sep, rest = strs.partition(',')
    lis = first.rsplit('_', 2)
    lis[1] = rep
    return "_".join(lis) + sep + rest
... 
>>> solve(" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}", "1111")
" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1111_+14155186442', {'cf:rv': '0'}"
>>> solve("'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}", "2222")
"'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_2222_14155186442', {'cf:rv': '0'}"
>>> solve("'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}", "2222")
"'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_2222_1371078139195', {'cf:rv': '0'}"

score 1 · Accepted Answer

像这样？

>>> line = "'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> re.subn('_(\d+)_', '_mynewnumber_', line, count=1) 
("'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_mynewnumber_+14155186442', {'cf:rv': '0'}",
1)

score 0 · Accepted Answer

import re

r = re.compile('([^,]*_)(\d+)(?=_[^_,]+,)(_.*)')

for line in ("'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}",
             "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"):
    print line
    print r.sub('\\1ABCDEFG\\3',line)
    print r.sub('\g<1>1234567\\3',line)

结果

'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_ABCDEFG_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}

'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_ABCDEFG_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}

\g<1>表示“第 1 组”。见文档：

除了如上所述的字符转义和反向引用之外，\g 将使用与名为 name 的组匹配的子字符串，如 (?P...) 语法所定义。\g 使用对应的组号；因此，\g<2> 等价于 \2，但在诸如 \g<2>0 之类的替换中并没有歧义。\20 将被解释为对第 20 组的引用，而不是对第 2 组的引用，后跟文字字符“0”。反向引用 \g<0> 替换 RE 匹配的整个子字符串。

score 0 · Accepted Answer

不像正则表达式那样复杂，但在未来编码、理解、调试和更改相对简单。除了分隔符之外，它不对构成“单词”的字母做出任何假设。

def replace_term(line, replacement):
    csep = line.split(',')
    usep = csep[0].split('_')
    return ','.join(['_'.join(usep[:-2] + [replacement] + usep[-1:])] + csep[1:])

lines = ["'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}",
         "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}",
         "'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}"]

for line in lines:
    print replace_term(line, 'XXX')

输出：

'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_XXX_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_XXX_14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_XXX_1371078139195', {'cf:rv': '0'}

python - 替换逗号分隔字符串中间的下划线分隔子字符串

5 回答 5

Related

Reference