html - Bash script to convert from HTML entities to characters

Question

I'm looking for a way to turn this:

hello &lt; world

to this:

hello < world

I could use sed, but how can this be accomplished without using cryptic regex?

score 100 · Accepted Answer

尝试重新编码（存档页面；GitHub 镜像；Debian 页面）：

$ echo '&lt;' |recode html..ascii
<

在 Linux 和类似的 Unix-y 系统上安装：

$ sudo apt-get install recode

在 Mac OS 上安装使用：

$ brew install recode

score 61 · Accepted Answer

使用 perl：

cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'

从命令行使用 php：

cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'

score 22 · Accepted Answer

另一种方法是通过网络浏览器进行管道传输——例如：

echo '!' | w3m -dump -T text/html

这在 cygwin 中对我来说非常有用，因为下载和安装发行版很困难。

在这里找到了这个答案

score 19 · Accepted Answer

19

使用 xmlstarlet：

echo 'hello &lt; world' | xmlstarlet unesc

于 2011-05-09T10:47:43.787 回答

score 14 · Accepted Answer

这个答案基于：Short way to escape HTML in Bash? 这适用于在 Stack Exchange 上获取答案（使用wget）并将 HTML 转换为常规 ASCII 字符：

sed 's/&nbsp;/ /g; s/&amp;/\&/g; s/&lt;/\</g; s/&gt;/\>/g; s/&quot;/\"/g; s/#&#39;/\'"'"'/g; s/&ldquo;/\"/g; s/&rdquo;/\"/g;'

编辑 1： 2017 年 4 月 7 日 - 添加了左双引号和右双引号转换。这是 web-scrapes SE 回答并将它们与本地代码文件进行比较的 bash 脚本的一部分：Ask Ubuntu - Code Version Control between local files 和 Ask Ubuntu answers

编辑 2017 年 6 月 26 日

使用sed来自 Ask Ubuntu / Stack Exchange 的 1K 行文件将 HTML 转换为 ASCII 大约需要 3 秒。因此，我被迫使用 Bash 内置搜索和替换大约 1 秒的响应时间。

这是功能：

LineOut=""      # Make global
HTMLtoText () {
    LineOut=$1  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&amp;/\&/g; s/&lt;/\</g; 
    # s/&gt;/\>/g; s/&quot;/\"/g; s/&#39;/\'"'"'/g; s/&ldquo;/\"/g; 
    # s/&rdquo;/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut//&nbsp;/ }"
    LineOut="${LineOut//&amp;/&}"
    LineOut="${LineOut//&lt;/<}"
    LineOut="${LineOut//&gt;/>}"
    LineOut="${LineOut//&quot;/'"'}"
    LineOut="${LineOut//&#39;/"'"}"
    LineOut="${LineOut//&ldquo;/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//&rdquo;/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

score 14 · Accepted Answer

python 3.2+ 版本：

cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'

score 1 · Accepted Answer

我喜欢https://stackoverflow.com/a/13161719/1506477中给出的 Perl 答案。

cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'

但是，它在纯文本文件上产生了不相等的行数。（而且我不知道 perl 足以调试它。）

我喜欢https://stackoverflow.com/a/42672936/1506477中给出的 python 答案——

python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'

但它会[ ... for l in sys.stdin]在内存中创建一个列表，对于大文件是禁止的。

这是另一种无需在内存中缓冲的简单 Python 方法：使用awkg.

$ echo 'hello &lt; &#x3a; &quot; world' | \
   awkg -b 'from html import unescape' 'print(unescape(R0))'
hello < : " world

awkg是一个基于 python 的类似 awk 的行处理器。您可以使用 pip https://pypi.org/project/awkg/安装它：

pip install awkg

-bBEGIN{}是在开始时运行一次的awk块。
在这里，我们刚刚做了from html import unescape。

每行记录都在R0变量中，为此我们做了 print(unescape(R0))

免责声明：
我是awkg

score 1 · Accepted Answer

仅使用 sed 替换来支持所有 HTML 实体的非转义将需要太长的命令列表以不实用，因为每个 Unicode 代码点至少有两个对应的 HTML 实体。

但它只能使用 sed、grep、Bourne shell 和基本的 UNIX 实用程序（GNU coreutils 或同等工具）来完成：

#!/bin/sh

htmlEscDec2Hex() {
    file=$1
    [ ! -r "$file" ] && file=$(mktemp) && cat >"$file"

    printf -- \
        "$(sed 's/\\/\\\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
        $(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')

    [ x"$1" != x"$file" ] && rm -f -- "$file"
}

htmlHexUnescape() {
    printf -- "$(
        sed 's/\\/\\\\/g;s/%/%%/g
            ;s/&#x\([0-9a-fA-F]\{1,8\}\);/\&#x0000000\1;/g
            ;s/&#x0*\([0-9a-fA-F]\{4\}\);/\\u\1/g
            ;s/&#x0*\([0-9a-fA-F]\{8\}\);/\\U\1/g' )\n"
}

htmlEscDec2Hex "$1" | htmlHexUnescape \
    | sed -f named_entities.sed

但是请注意，需要支持\uHHHH和\UHHHHHHHH序列的 printf 实现，例如 GNU 实用程序。要进行测试，请检查例如printf "\u00A7\n"打印§. 要调用实用程序而不是内置的 shell，请将出现的替换printf为env printf。

此脚本使用附加文件，named_entities.sed以支持命名实体。它可以使用以下 HTML 页面从规范中生成：

<!DOCTYPE html>
<head><meta charset="utf-8" /></head>
<body>
<p id="sed-script"></p>
<script type="text/javascript">
  const referenceURL = 'https://html.spec.whatwg.org/entities.json';

  function writeln(element, text) {
    element.appendChild( document.createTextNode(text) );
    element.appendChild( document.createElement("br") );
  }

  (async function(container) {
    const json = await (await fetch(referenceURL)).json();
    container.innerHTML = "";
    writeln(container, "#!/usr/bin/sed -f");
    const addLast = [];
    for (const name in json) {
      const characters = json[name].characters
        .replace("\\", "\\\\")
        .replace("/", "\\/");
      const command = "s/" + name + "/" + characters + "/g";
      if ( name.endsWith(";") ) {
        writeln(container, command);
      } else {
        addLast.push(command);
      }
    }
    for (const command of addLast) { writeln(container, command); }
  })( document.getElementById("sed-script") );
</script>
</body></html>

只需在现代浏览器中打开它，然后将生成的页面另存为named_entities.sed. 如果只需要命名实体，这个 sed 脚本也可以单独使用；在这种情况下，可以方便地赋予它可执行权限，以便可以直接调用它。

现在，上面的 shell 脚本可以用作./html_unescape.sh foo.html，或者在从标准输入读取的管道中使用。

例如，如果由于某种原因需要按块处理数据（如果printf不是内置的 shell 并且要处理的数据很大，则可能是这种情况），可以将其用作：

nLines=20
seq 1 $nLines $(grep -c $ "$inputFile") | while read n
    do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
done

脚本说明如下。

需要支持三种类型的转义序列：

&#D;其中D是转义字符的 Unicode 代码点的十进制值；
&#xH;其中H是转义字符的 Unicode 代码点的十六进制值；
&N;其中N是转义字符的命名实体之一的名称。

生成的脚本&N;支持转义，该named_entities.sed脚本仅执行替换列表。

这种支持代码点转义的方法的核心部分是printf实用程序，它能够：

以十六进制格式打印数字，以及
从其代码点的十六进制值打印字符（使用转义\uHHHH或\UHHHHHHHH）。

在 sed 和 grep 的帮助下，第一个功能用于将转义减少&#D;为&#xH;转义。shell 函数htmlEscDec2Hex就是这样做的。

该函数htmlHexUnescape使用 sed 将&#xH;转义转换为 printf 的\u/\U转义，然后使用第二个功能打印未转义的字符。

score 1 · Accepted Answer

在 macOS 上，您可以使用内置命令textutil（通常是一个方便的实用程序）：

echo '&#128075; hello &lt; world &#x1f310;' | textutil -convert txt -format html -stdin -stdout

输出：

 hello < world

score 0 · Accepted Answer

我的原始答案得到了一些评论，这recode不适用于 UTF-8 编码的 HTML 文件。这是对的。recode仅支持 HTML 4。编码HTML是的别名HTML_4.0：

$ recode -l | grep -iw html
HTML-i18n 2070 RFC2070
HTML_4.0 h h4 HTML

HTML 4 的默认编码是 Latin-1。这在 HTML 5 中有所改变。HTML 5 的默认编码是 UTF-8。这就是为什么recode不适用于 HTML 5 文件的原因。

HTML 5 在这里定义了实体列表：

https://html.spec.whatwg.org/multipage/named-characters.html

该定义包括 JSON 格式的机器可读规范：

https://html.spec.whatwg.org/entities.json

JSON 文件可用于执行简单的文本替换。以下示例是一个自我修改的 Perl 脚本，它将 JSON 规范缓存在其 DATA 块中。

注意：出于一些晦涩的兼容性原因，规范允许没有终止分号的实体。因为实体按长度以相反的顺序排序，以确保首先替换正确的实体，这样它们就不会被没有结尾分号的实体破坏。

#! /usr/bin/perl
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);
use LWP::Simple;
use JSON::Parse qw(parse_json);

my $entities;

INIT {
  if (eof DATA) {
    my $data = tell DATA;
    open DATA, '+<', $0;
    seek DATA, $data, 0;
    my $entities_json = get 'https://html.spec.whatwg.org/entities.json';
    print DATA $entities_json;
    truncate DATA, tell DATA;
    close DATA;
    $entities = parse_json ($entities_json);
  } else {
    local $/ = undef;
    $entities = parse_json (<DATA>);
  }
}

local $/ = undef;
my $html = <>;

for my $entity (sort { length $b <=> length $a } keys %$entities) {
  my $characters = $entities->{$entity}->{characters};
  $html =~ s/$entity/$characters/g;
}

print $html;

__DATA__

示例用法：

$ echo '&nbsp;&amp;&nbsp;ٱلْعَرَبِيَّة' | ./html5-to-utf8.pl
 & ٱلْعَرَبِيَّة

score 0 · Accepted Answer

我已经根据实体列表创建了一个 sed脚本，因此它必须处理大多数实体。

sed -f htmlentities.sed < file.html

score -1 · Accepted Answer

与西德尔：

echo 'hello &lt; &#x3a; &quot; world' | xidel -s - -e 'parse-html($raw)'
hello < : " world

html - Bash script to convert from HTML entities to characters

12 回答 12

编辑 2017 年 6 月 26 日

Related

Reference