python - 来自 PGSQL 的 UTF-8 内容的 re.sub() 问题

Question

我尝试了许多不同的东西并查看了许多 SO 答案（以及来自其他网站的东西），但我似乎无法弄清楚这一点。那里有很多相互矛盾的信息。

我有一些内容存储在 PostgreSQL 中，采用 UTF8 (SET client_encoding = 'UTF8';)。我正在从数据库中提取所述内容，然后将任何“£”符号包装在一个跨度中。

相关片段：

for i in range(0, len(results)):
  content = results[i][2].decode('utf8')
  pattern = re.compile(ur'(\u00A3[0-9]+)(\.[0-9]{1,2})?', re.UNICODE)
  content = re.sub(pattern, '<span class="price">\0\1</span>', content)
  app.logger.debug(test)

样本输出：

DEBUG in **** [****.py:143]:
Prices from only <span class="price"></span> for a framed picture.

编辑：而且我知道 REGEXP 可能很糟糕。

score 1 · Accepted Answer

尝试在正则表达式中使用捕获/命名组首先检查正则表达式是否与一般标题一起使用，然后仅包装您需要的内容或删除您需要的内容：

for i in range(0, len(results)):
  pattern = re.compile('[0-9]*(?P<todelete>\W)?')
  todelete = pattern.match(i[0][2]).group('todelete')
  content = todelete.sub("", i[0][2])

顺便说一句，从 posgresql 中提取信息，我推荐psycopg2可以在一个简单的列表中正确获取一个或多个结果并尊重一般编码：这可能会避免很多麻烦。

python - 来自 PGSQL 的 UTF-8 内容的 re.sub() 问题

1 回答 1

Related

Reference