3

我有一个包含多行的文件,如下所示:

 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}

我想用另一个号码替换 1371078139195(在这种情况下)。我要替换的值始终位于第一个逗号分隔的单词中,并且始终是该单词中倒数第二个下划线分隔的值。以下是我这样做的方式并且它有效,但这似乎不合时宜且笨拙。

>>> line="'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> l1=",".join(line.split(",")[1:])
>>> print l1
 {'cf:rv': '0'}
>>> l2=line.split(",")[0]
>>> print l2
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442'
>>> print "_".join(l2.split('_')[:-2])
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight
>>>
>>> print "_".join(l2.split('_')[:-2])+ "_1234567_"+(l2.split('_')[-1])
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442'
>>> print "_".join(l2.split('_')[:-2])+ "_1234567_"+(l2.split('_')[-1]) + "," + l1
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}
>>>

是否有更简单的方法来替换(可能使用正则表达式)该值?我无法想象这是最好的方法

我有几个答案,我必须强调它是倒数第二个强调值。以下是有效的字符串:

line = "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
line = "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}"
line = "'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}"

在上述情况下,字符串中有一个数字字符串不在倒数第二个下划线之后。最后一部分可能是也可能不是全数字(可能是+14155186442,也可能是14155186442)。对不起,我没有在上面提到这一点。

一个

4

5 回答 5

4

使用正则表达式:

m = re.match("([^,]*_)([+]?[0-9]+)(_.*)", s)
if m:
    before = m.group(1)
    number = m.group(2)
    after = m.group(3)
    s = before + new_number(number) + after

意思是

  • [^,]*_= 你想要多少个字符但不是逗号,后跟一个下划线
  • [+]?[0-9]+= 数字,前面可选+
  • _.*= 下划线后跟任何内容

这是有效的,因为正则表达式匹配默认是“贪婪的”,所以[^,]*实际上会使用所有下划线,在倒数第二个之前停止以使匹配成功。

例如,如果您需要而不是倒数第二个下划线分隔您需要倒数第三个,则表达式可以更改为

m = re.match("([^,]*_)([+]?[0-9]+)(_[^,]*_.*)", s)

因此要求数字后的逗号前至少有两个下划线。

于 2013-09-16T14:43:39.243 回答
3

非正则表达式解决方案:

>>> strs = " 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> first, sep, rest = strs.partition(',')
>>> lis = first.rsplit('_', 2)
>>> lis[1] = "1111111"
>>> "_".join(lis) + sep + rest
" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1111111_+14155186442', {'cf:rv': '0'}"

功能:

def solve(strs, rep):                                                                                                   first, sep, rest = strs.partition(',')
    lis = first.rsplit('_', 2)
    lis[1] = rep
    return "_".join(lis) + sep + rest
... 
>>> solve(" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}", "1111")
" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1111_+14155186442', {'cf:rv': '0'}"
>>> solve("'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}", "2222")
"'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_2222_14155186442', {'cf:rv': '0'}"
>>> solve("'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}", "2222")
"'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_2222_1371078139195', {'cf:rv': '0'}"
于 2013-09-16T14:28:49.247 回答
1

像这样?

>>> line = "'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> re.subn('_(\d+)_', '_mynewnumber_', line, count=1) 
("'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_mynewnumber_+14155186442', {'cf:rv': '0'}",
1)
于 2013-09-16T14:23:28.473 回答
0
import re

r = re.compile('([^,]*_)(\d+)(?=_[^_,]+,)(_.*)')

for line in ("'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}",
             "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"):
    print line
    print r.sub('\\1ABCDEFG\\3',line)
    print r.sub('\g<1>1234567\\3',line)

结果

'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_ABCDEFG_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}

'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_ABCDEFG_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}

\g<1>表示“第 1 组”。见文档:

除了如上所述的字符转义和反向引用之外,\g 将使用与名为 name 的组匹配的子字符串,如 (?P...) 语法所定义。\g 使用对应的组号;因此,\g<2> 等价于 \2,但在诸如 \g<2>0 之类的替换中并没有歧义。\20 将被解释为对第 20 组的引用,而不是对第 2 组的引用,后跟文字字符“0”。反向引用 \g<0> 替换 RE 匹配的整个子字符串。

于 2013-09-16T15:25:07.853 回答
0

不像正则表达式那样复杂,但在未来编码、理解、调试和更改相对简单。除了分隔符之外,它不对构成“单词”的字母做出任何假设。

def replace_term(line, replacement):
    csep = line.split(',')
    usep = csep[0].split('_')
    return ','.join(['_'.join(usep[:-2] + [replacement] + usep[-1:])] + csep[1:])

lines = ["'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}",
         "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}",
         "'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}"]

for line in lines:
    print replace_term(line, 'XXX')

输出:

'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_XXX_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_XXX_14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_XXX_1371078139195', {'cf:rv': '0'}
于 2013-09-16T16:09:54.820 回答