python - 在 Python 中用正则表达式批量替换

Question

对于 Django 应用程序，如果我的数据库中有与匹配项相关的资源，我需要将字符串中所有出现的模式转换为链接。

现在，过程如下： - 我使用 re.sub 处理很长的文本字符串 - 当 re.sub 找到模式匹配时，它运行一个函数来查找该模式是否与数据库中的条目匹配 - 如果有是一个匹配项，它包裹了链接包裹了匹配项周围的链接。

问题是有时对数据库有数百次点击。我希望能够做的是对数据库的单个批量查询。

那么：你能在 Python 中使用正则表达式进行批量查找和替换吗？

作为参考，这里是代码（对于好奇，我正在查找的模式是为了合法引用）：

def add_linked_citations(text):
    linked_text = re.sub(r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))', create_citation_link, text)
    return linked_text

def create_citation_link(match_object):
    volume = None
    reporter = None
    page = None
    if match_object.group("volume") not in [None, '']:
        volume = match_object.group("volume")
    if match_object.group("reporter") not in [None, '']:
        reporter = match_object.group("reporter")
    if match_object.group("page") not in [None, '']:
        page = match_object.group("page")

    if volume and reporter and page: # These should all be here...
        # !!! Here's where I keep hitting the database
        citations = Citation.objects.filter(volume=volume, reporter=reporter, page=page)
        if citations.exists():
            citation = citations[0] 
            document = citation.document
            url = document.url()
            return '<a href="%s">%s %s %s</a>' % (url, volume, reporter, page)
        else:
            return '%s %s %s' % (volume, reporter, page)

score 1 · Accepted Answer

抱歉，如果这是明显和错误的（在 4 小时内没有人提出建议令人担忧！），但为什么不搜索所有匹配项，对所有内容进行批量查询（一旦你有所有匹配项就很容易），然后调用 sub使用结果字典（因此函数从字典中提取数据）？

您必须运行两次正则表达式，但似乎数据库访问无论如何都是昂贵的部分。

score 1 · Accepted Answer

finditer您可以通过使用返回匹配对象的单个正则表达式传递来做到这一点。

匹配对象有：

返回命名组的字典的方法，groupdict()
原始文本中匹配的开始和结束位置，span()
原始匹配文本，group()

所以我建议你：

使用以下方法列出文本中的所有匹配项finditer
列出匹配中所有唯一的卷、记者、页面三元组
查找那些三胞胎
如果找到，将每个匹配对象与三元组查找的结果相关联
处理原始文本，按匹配范围分割并插入查找结果。

我已经通过组合一个列表来实现数据库查找Q(volume=foo1,reporter=bar2,page=baz3)|Q(volume=foo1,reporter=bar2,page=baz3)...。也许有更有效的方法。

这是一个未经测试的实现：

from django.db.models import Q
from collections import namedtuple

Triplet = namedtuple('Triplet',['volume','reporter','page'])

def lookup_references(matches):
  match_to_triplet = {}
  triplet_to_url = {}
  for m in matches:
    group_dict = m.groupdict()
    if any(not(x) for x in group_dict.values()): # Filter out matches we don't want to lookup
      continue
    match_to_triplet[m] = Triplet(**group_dict)
  # Build query
  unique_triplets = set(match_to_triplet.values())
  # List of Q objects
  q_list = [Q(**trip._asdict()) for trip in unique_triplets]
  # Consolidated Q
  single_q = reduce(Q.__or__,q_list)
  for row in Citations.objects.filter(single_q).values('volume','reporter','page','url'):
    url = row.pop('url')
    triplet_to_url[Triplet(**row)] = url
  # Now pair original match objects with URL where found
  lookups = {}
  for match, triplet in match_to_triplet.items():
    if triplet in triplet_to_url:
      lookups[match] = triplet_to_url[triplet]
  return lookups

def interpolate_citation_matches(text,matches,lookups):
  result = []
  prev = m_start = 0
  last = m_end = len(text)
  for m in matches:
    m_start, m_end = m.span()
    if prev != m_start:
      result.append(text[prev:m_start])
    # Now check match
    if m in lookups:
      result.append('<a href="%s">%s</a>' % (lookups[m],m.group()))
    else:
      result.append(m.group())
  if m_end != last:
    result.append(text[m_end:last])
  return ''.join(result)

def process_citations(text):
  citation_regex = r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))'
  matches = list(re.finditer(citation_regex,text))
  lookups = lookup_references(matches)
  new_text = interpolate_citation_matches(text,matches,lookups)
  return new_text

python - 在 Python 中用正则表达式批量替换

2 回答 2

Related

Reference