python - Python: Regex outputs 12_34 - I need 1234

Question

So I have input coming in as follows: 12_34 5_6_8_2 4_____3 1234

and the output I need from it is: 1234, 5682, 43, 1234

I'm currently working with r'[0-9]+[0-9_]*'.replace('_',''), which, as far as I can tell, successfully rejects any input which is not a combination of numeric digits and under-scores, where the underscore cannot be the first character.

However, replacing the _ with the empty string causes 12_34 to come out as 12 and 34.

Is there a better method than 'replace' for this? Or could I adapt my regex to deal with this problem?

EDIT: Was responding to questions in comments below, I realised it might be better specified up here. So, the broad aim is to take a long input string (small example: "12_34 + 'Iamastring#' I_am_an_Ident" and return: ('NUMBER', 1234), ('PLUS', '+'), ('STRING', 'Iamastring#'), ('IDENT', 'I_am_an_Ident') I didn't want to go through all that because I've got it all working as specified, except for number. The solution code looks something like: tokens = ('PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'IDENT', 'STRING', 'NUMBER') t_PLUS = "+" t_MINUS = '-' and so on, down to: t_NUMBER = ###code goes here I'm not sure how to put multi-line processes into t_NUMBER

score 2 · Accepted Answer

我不确定您的意思以及为什么需要正则表达式，但这可能会有所帮助

In [1]: ins = '12_34 5_6_8_2 4_____3 1234'

In [2]: for x in ins.split(): print x.replace('_', '')
1234
5682
43
1234

编辑以回应编辑的问题：

我仍然不太确定你在那里用令牌做什么，但我会做类似的事情（至少这对我来说很有意义：

input_str = "12_34 + 'Iamastring#' I_am_an_Ident" 
tokens = ('NUMBER', 'SIGN', 'STRING', 'IDENT')
data = dict(zip(tokens, input_str.split()))

这会给你

{'IDENT': 'I_am_an_Ident',
 'NUMBER': '12_34',
 'SIGN': '+',
 'STRING': "'Iamastring#'"}

然后你可以做

data['NUMBER'] = int(data['NUMBER'].replace('_', ''))

和任何你喜欢的东西。

PS对不起，如果它没有帮助，但我真的不明白拥有tokens = ('PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'IDENT', 'STRING', 'NUMBER')等的意义。

score 0 · Accepted Answer

你似乎在做类似的事情：

>>> data = '12_34 5_6_8_2 4_____3 1234'
>>> pattern = '[0-9]+[0-9_]*'
>>> re.findall(pattern, data)
['12_34', '5_6_8_2', '4_____3', '1234']
re.findall(pattern.replace('_', ''), data)
['12', '34', '5', '6', '8', '2', '4', '3', '1234']

问题是这pattern.replace不是从匹配re中删除s 的信号，它会将您的正则表达式更改为: 。你想要做的是对结果而不是模式做 - 例如，_'[0-9]+[0-9]*'replace

>>> [match.replace('_', '') for match in re.findall(pattern, data)]
['1234', '5682', '43', '1234']

另请注意，您的正则表达式可以稍微简化；由于这是家庭作业，我将省略详细信息。

score 0 · Accepted Answer

好吧，如果你真的必须使用reand only re，你可以这样做：

import re

def replacement(match):
    separator_dict = {
        '_': '',
        ' ': ',',
    }
    for sep, repl in separator_dict.items():
        if all( (char == sep for char in match.group(2)) ):
            return match.group(1) + repl + match.group(3)

def rec_sub(s):
    """
    Recursive so it works with any number of numbers separated by underscores.
    """
    new_s = re.sub('(\d+)([_ ]+)(\d+)', replacement, s)
    if new_s == s:
        return new_s
    else:
        return rec_sub(new_s)

但这体现了矫枉过正的概念。

score 0 · Accepted Answer

0

a='12_34 5_6_8_2 4___3 1234'
>>> a.replace('_','').replace(' ',', ')
'1234, 5682, 43, 1234'
>>>

于 2012-06-03T10:48:54.170 回答

score 0 · Accepted Answer

你的问题的措辞有点不清楚。如果您不关心输入验证，则以下内容应该有效：

input = '12_34 5_6_8_2 4_____3 1234'
re.sub('\s+', ', ', input.replace('_', ''))

如果您需要实际去除所有不是数字或空格的字符并在数字之间添加逗号，那么：

re.sub('\s+', ', ', re.sub('[^\d\s]', '', input))

...应该完成任务。当然，编写一个只需要遍历字符串一次而不是使用多次re.sub()调用的函数可能会更有效。

python - Python: Regex outputs 12_34 - I need 1234

5 回答 5

Related

Reference