2

我正在寻找一种从同时具有大写字母和可能小写字母的字符串中获取小写值的方法

这是一个例子

sequences = ['CABCABCABdefgdefgdefgCABCAB','FEGFEGFEGwowhelloFEGFEGonemoreFEG','NONEARELOWERCASE'] #sequences with uppercase and potentially lowercase letters

这就是我想要输出的

upper_output = ['CABCABCABCABCAB','FEGFEGFEGFEGFEGFEG','NONEARELOWERCASE'] #the upper case letters joined together
lower_output = [['defgdefgdefg'],['wowhello','onemore'],[]] #the lower case letters in lists within lists
lower_indx = [[9],[9,23],[]] #where the lower case values occur in the original sequence

所以我希望 lower_output 列表是 SUBLISTS 的列表。SUBLISTS 将包含所有小写字母字符串。

我正在考虑使用正则表达式。. .

import re

lower_indx = []

for seq in sequences:
    lower_indx.append(re.findall("[a-z]", seq).start())

print lower_indx

对于我尝试的小写列表:

lower_output = []

for seq in sequences:
    temp = ''
    temp = re.findall("[a-z]", seq)
    lower_output.append(temp)

print lower_output

但这些值不在单独的列表中(我仍然需要加入它们)

[['d', 'e', 'f', 'g', 'd', 'e', 'f', 'g', 'd', 'e', 'f', 'g'], ['w', 'o', 'w', 'h', 'e', 'l', 'l', 'o', 'o', 'n', 'e', 'm', 'o', 'r', 'e'], []]
4

2 回答 2

4

听起来(我可能误解了你的问题)你只需要捕获小写字母的运行,而不是每个单独的小写字母。这很简单:只需将+量词添加到您的正则表达式中。

for seq in sequences:
    lower_output.append(re.findall("[a-z]+", seq)) # add substrings

量词指定您想要前面表达式的+“至少一个,并且尽可能多地在一行中找到”(在这种情况下'[a-z]')。因此,这将在一组中捕获所有小写字母的完整运行,这将使它们按照您希望的方式出现在输出列表中。

如果你想保留你的列表结构并获取索引,它会变得有点丑陋,但它仍然非常简单:

for seq in sequences:
    matches = re.finditer("[a-z]+", seq) # List of Match objects.
    lower_output.append([match.group(0) for match in matches]) # add substrings
    lower_indx.append([match.start(0) for match in matches]) # add indices

print lower_output
>>> [['defgdefgdefg'], ['wowhello', 'onemore'], []]

print lower_indx
>>> [[9], [9, 23], []]
于 2013-04-04T21:05:34.337 回答
0

除了正则表达式,您还可以itertools.groupby在此处使用:

In [39]: sequences = ['CABCABCABdefgdefgdefgCABCAB','FEGFEGFEGwowhelloFEGFEGonemoreFEG','NONEARELOWERCASE'] #sequences with uppercase and potentially lowercase letters

In [40]: lis=[["".join(v) for k,v in groupby(x,key=lambda z:z.islower())] for x in sequences]

In [41]: upper_output=["".join(x[::2]) for x in lis]

In [42]: lower_output=[x[1::2] for x in lis]

In [43]: upper_output
Out[43]: ['CABCABCABCABCAB', 'FEGFEGFEGFEGFEGFEG', 'NONEARELOWERCASE']

In [44]: lower_output
Out[44]: [['defgdefgdefg'], ['wowhello', 'onemore'], []]

In [45]: lower_indx=[[sequences[i].index(y) for y in x] for i,x in enumerate(lower_output)]

In [46]: lower_indx
Out[46]: [[9], [9, 23], []]
于 2013-04-04T21:04:15.690 回答