python - Python 正则表达式挑战缩进

Question

试图解决一个我知道我可以通过迭代字符串来解决的问题，但是使用 python 我确信有一个正则表达式可以更优雅地解决它......感觉就像放弃诉诸迭代过程！

基本上，我在单个属性单元格中有一个列表，我需要确定哪些属性是子属性，哪些是子子属性，并将它们与它们所在的属性相匹配。例如：

ID=11669 Antam 红土镍/镍铁运营
     ID=19807 Gebe 红土镍矿
     ID=19808 Gee Island 红土镍矿
     ID=18923 Mornopo 红土镍矿
     ID=29411 Pomalaa 镍铁冶炼厂
     ID=19806 Pomalaa 红土镍矿
          ID=29412 Maniang 红土镍项目
     ID=11665 东南苏拉威西红土镍矿项目
          ID=27877 巴胡布卢红土镍矿床

应该生成：

MasterProp,    SubProp
11669,          19807
11669,          19808
11669,          18923
11669,          29411
11669,          19806
19806,          29412
11669,          11665
11665,          27877

获取 11669 和第二级很容易 - 只需获取我找到的第一个 ID，然后添加到其余所有 ID。但是要达到“第三级”要困难得多

我尝试了以下

tags = re.compile('ID=(\d+).+(\&nbsp\;){8}')                        
for tag, space in tags.findall(str(cell)): 
    print tag

但这给了我在 8 个空格之前的第一个 ID，而不是在 8 个空格之前的最后一个 ID...所以在上面的示例中，我得到11669而不是19806. 我怀疑我可以输入一个表达式，说找到一个ID=(\d+)在它和 8 个空格之间没有其他的地方ID=(\d+)，但这已经证明超出了我（新手）的能力！欢迎任何帮助...

score 1 · Accepted Answer

使用 BS 获取标签后，您需要执行以下操作：

>>> from urlparse import urlparse, parse_qs
>>> myurl = 'ShowProp.asp?LL=PS&ID=19807'
>>> parse_qs(urlparse(myurl).query)
{'LL': ['PS'], 'ID': ['19807']}
>>> parse_qs(urlparse(myurl).query)['ID']
['19807']
>>>

score 0 · Accepted Answer

我认为带有 HTML 的示例代码更有意义 - 实际数据，而不是挥手。

bs = BeautifulSoup.BeautifulSoup(html)

parent_stack = [None]
res = []
for span in bs.findAll('span', {'style':'white-space:nowrap;display:inline-block'}):
    indent = 1 + span.previousSibling.count('&nbsp;') / 5
    id = int(span.find('input')['value'])
    name = span.find('a').text.strip()

    # warning! this assumes that indent-level only ever
    #   increases by 1 level at a time!
    parent_stack = parent_stack[:indent] + [id]
    res.append(parent_stack[-2:])

结果是

[[None, 11669],
 [11669, 19807],
 [11669, 19808],
 [11669, 18923],
 [11669, 29411],
 [11669, 19806],
 [19806, 29412],
 [11669, 11665],
 [11665, 27877],
 [11665, 50713],
 [11665, 27879],
 [11665, 27878],
 [11669, 11394]]

python - Python 正则表达式挑战缩进

2 回答 2

Related

Reference