html - 剥离 HTML 字符并转换为纯文本

Question

好的，我已经搜索了几个小时！寻求答案。我发现的一切都没有完成我想要做的事情。

我们的客户喜欢将 HTML 网站的部分内容直接复制到 TinyMCE 所见即所得编辑器和纯文本文本区域或输入字段（用于标题）中。问题是所见即所得的字符是 HTML 而不是 RAW html。

这里只是一个例子。请记住，我想适应任何可能引发此错误的字符。

伴侣双重按摩 – 两个座位步入浴缸

中间的那个 DASH 有 html 实体–

直接复制 HTML 并将其粘贴到纯文本输入字段或文本区域中会引发错误

编码“UTF8”的无效字节序列：0x96

尝试提交到 UTF8 数据库时。

客户有可能复制商标、版权或保留符号。

我不只是想剥离它们。我想转换它们。

我试过各种转换器。我不想列出我去过的每个网站。

有任何想法吗？

更糟糕的情况是我只取这 4 个字符并将它们转换为任何字符。

score 0 · Accepted Answer

This is an encoding problem, not a problem with the HTML entities. When you copy data from HTML into a text box, the browser is not pasting in the entity like –, it's pasting in the actual character. It looks like the character you are getting is encoded in Windows-1252 (sometimes mistakenly referred to as ISO-8859-1). Since the database is expecting UTF-8, it can't handle this character.

There are a few possible reasons this might be happening. You didn't list what browser, language, web framework, or database you're using, so I'm going to offer a few suggestions, and hopefully one of them works. In general, it is best to use UTF-8 for your encoding at every stage; but if that't not possible, you either need to use a consistent encoding throughout all of the levels, or you need to convert.

Since your database is using UTF-8, I'll assume that's the encoding that you want to use. One thing to check is whether your pages are being served as UTF-8. Check the headers on your HTTP response; there should be a Content-Type: text/html; charset=utf-8 header. If that is wrong, missing, or missing the charset=utf-8 part, then the browser may choose the wrong charset. One more thing that's good to do is add a <meta charset=utf-8> tag in your <head>; while this isn't necessary if you have the charset sent as part of the HTTP headers, it can help select the correct charset if the headers aren't present, or the document is loaded from a file: URL or the like, which doesn't have headers available.

While the browser should use the character set of the document when submitting the form, you can ensure that it submits using the correct charset by using the accept-charset attribute on the form: <form accept-charset=utf-8>. This will ensure that even if the page has the no charset set in the headers, forms will submit data as UTF-8.

Finally, even if all of that is correct, IE 5 through 8 will sometimes submit data in a different encoding than what the page is sent in, if the user has changed their encoding settings. To force it to send UTF-8 data, you can use a hidden form attribute that includes a character that cannot be encoded in a legacy encoding like Windows-1252. Some versions of Ruby on Rails famously used a snowman (☃) for this purpose, though it was later changed to a checkmark (✓) to be less puzzling. You can add a similar element to your form to force IE to use UTF-8: <input name="_utf7" type="hidden" value="✓">.

If the above suggestions don't work, please let us know what browser, programming language, web framework, and database you are using, and try to provide a short, self-contained piece of sample code that demonstrates the problem.

score 0 · Accepted Answer

尝试这个。将“旧”数据转换为 Utf-8 需要一点努力。“旧”是指来自我们旧数据库的数据，可以是 UTF-8 或拉丁文，也可以是转义字符或非转义字符。结果始终是包含原始字符（而不是实体）的 Utf-8 字符串。

/**
 * Decodes HTML entities and converts the string to UTF-8 if it isn't UTF-8 already.
 * @param string $string LATIN-1 or UTF-8 string that may contain html_encoded characters.
 * @returns string
*/
private function tidyUtf8($string)
{
  // Check if the string contains any Latin characters that are not valid UTF-8.
  $utfCheckString = @iconv(
       'UTF-8',
       'UTF-8//IGNORE',
       $string
  );
  $isUtf = ($string === $utfCheckString);

  // If the string is not UTF-8, convert it to UTF-8
  if ($isUtf === false)
  {
       // Decode HTML entities to prevent double encoding later. 
       // Decode only the ones that are valid LATIN-1 characters.
       $string = html_entity_decode($string, ENT_QUOTES, 'ISO-8859-1');
       $string = iconv('ISO-8859-1', 'UTF-8', $string);
  }

  // Decode all HTML entities to prevent double encoding later. 
  // Include UTF-8 characters.
  $string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');

  return $string;
}

此功能旨在接受 UTF-8 和 LATIN-1(ISO-8859-1)。您可能不需要后者，因此您可以剥离此功能的一部分并使用：

html_entity_decode($string, ENT_QUOTES, 'UTF-8');

html - 剥离 HTML 字符并转换为纯文本

2 回答 2

Related

Reference