sql - PostgreSQL - 替换 HTML 实体

Question

我刚刚开始了从我们的数据库中删除 HTML 实体的任务，因为我们做了很多爬虫，而一些爬虫在输入时没有这样做:(

所以我开始写一堆看起来像的查询；

UPDATE nodes SET name=regexp_replace(name, '&#xe0;', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, '&#xe1;', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, '&#xe2;', 'â', 'g') WHERE name LIKE '%#xe2%';

这显然是一种非常幼稚的方法。我一直在试图弄清楚解码功能是否可以做一些聪明的事情；也许通过 regex like 抓取 html 实体/&#x(..);/，然后仅将%1部分传递给 ascii 解码器，并重建字符串......或其他东西......

我应该继续查询吗？大概只有40个左右。

score 7 · Accepted Answer

使用 pl/perlu 编写函数并使用此模块https://metacpan.org/pod/HTML::Entities

当然你需要安装 perl 并且 pl/perl 可用。

1）首先创建程序语言pl/perlu：

CREATE EXTENSION plperlu;

2）然后创建一个这样的函数：

CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
    use HTML::Entities;
    return decode_entities($_[0]);
$$ LANGUAGE plperlu;

3）然后你可以像这样使用它：

select decode_html_entities('aaabbb&amp;.... asasdasdasd &hellip;');
   decode_html_entities    
---------------------------
 aaabbb&.... asasdasdasd …
(1 row)

score 5 · Accepted Answer

您可以使用 xpath（HTML 编码的内容与 XML 编码的内容相同）：

select 
  'AT&amp;T' as input ,
  (xpath('/z/text()', ('<z>' || 'AT&amp;T' || '</z>')::xml))[1] as output

score 1 · Accepted Answer

这就是我使用 PG10 在 Ubuntu 18.04 上工作所需要的，而 Perl,出于某种原因没有解码某些实体。所以我使用了Python3。

从命令行

sudo apt install postgresql-plpython3-10

从您的 SQL 界面：

CREATE LANGUAGE plpython3u;

CREATE OR REPLACE  FUNCTION htmlchars(str TEXT) RETURNS TEXT AS $$
    from html.parser import HTMLParser
    h = HTMLParser() 
    if str is None:
        return str
    return h.unescape(str);
$$ LANGUAGE plpython3u;

sql - PostgreSQL - 替换 HTML 实体

3 回答 3

Related

Reference