javascript - 在保留换行符（使用 JavaScript）的同时将 HTML 转换为纯文本最方便的方法是什么？

Question

基本上我只需要从浏览器窗口复制 HTML 并将其粘贴到 textarea 元素中的效果。

例如我想要这个：

<p>Some</p>
<div>text<br />Some</div>
<div>text</div>

变成这样：

Some
text
Some
text

score 19 · Accepted Answer

If that HTML is visible within your web page, you could do it with the user selection (or just a TextRange in IE). This does preserve line breaks, if not necessarily leading and trailing white space.

UPDATE 10 December 2012

However, the toString() method of Selection objects is not yet standardized and works inconsistently between browsers, so this approach is based on shaky ground and I don't recommend using it now. I would delete this answer if it weren't accepted.

Demo: http://jsfiddle.net/wv49v/

Code:

function getInnerText(el) {
    var sel, range, innerText = "";
    if (typeof document.selection != "undefined" && typeof document.body.createTextRange != "undefined") {
        range = document.body.createTextRange();
        range.moveToElementText(el);
        innerText = range.text;
    } else if (typeof window.getSelection != "undefined" && typeof document.createRange != "undefined") {
        sel = window.getSelection();
        sel.selectAllChildren(el);
        innerText = "" + sel;
        sel.removeAllRanges();
    }
    return innerText;
}

score 7 · Accepted Answer

我试图找到一些我为这段时间写的代码，我曾经使用过。它工作得很好。让我概述一下它做了什么，希望你能复制它的行为。

用 alt 或标题文本替换图像。
用“文本[链接]”替换链接
替换通常会产生垂直空白的东西。h1-h6、div、p、br、hr 等（我知道，我知道。这些实际上可能是内联元素，但效果很好。）
去掉其余的标签并用一个空字符串替换。

您甚至可以进一步扩展它以格式化有序列表和无序列表等内容。这真的取决于你想走多远。

编辑

找到代码！

public static string Convert(string template)
{
    template = Regex.Replace(template, "<img .*?alt=[\"']?([^\"']*)[\"']?.*?/?>", "$1"); /* Use image alt text. */
    template = Regex.Replace(template, "<a .*?href=[\"']?([^\"']*)[\"']?.*?>(.*)</a>", "$2 [$1]"); /* Convert links to something useful */
    template = Regex.Replace(template, "<(/p|/div|/h\\d|br)\\w?/?>", "\n"); /* Let's try to keep vertical whitespace intact. */
    template = Regex.Replace(template, "<[A-Za-z/][^<>]*>", ""); /* Remove the rest of the tags. */

    return template;
}

score 5 · Accepted Answer

我根据这个答案做了一个函数：https ://stackoverflow.com/a/42254787/3626940

function htmlToText(html){
    //remove code brakes and tabs
    html = html.replace(/\n/g, "");
    html = html.replace(/\t/g, "");

    //keep html brakes and tabs
    html = html.replace(/<\/td>/g, "\t");
    html = html.replace(/<\/table>/g, "\n");
    html = html.replace(/<\/tr>/g, "\n");
    html = html.replace(/<\/p>/g, "\n");
    html = html.replace(/<\/div>/g, "\n");
    html = html.replace(/<\/h>/g, "\n");
    html = html.replace(/<br>/g, "\n"); html = html.replace(/<br( )*\/>/g, "\n");

    //parse html into text
    var dom = (new DOMParser()).parseFromString('<!doctype html><body>' + html, 'text/html');
    return dom.body.textContent;
}

score 1 · Accepted Answer

根据chrmcpn 的回答，我必须将基本的 HTML 电子邮件模板转换为纯文本版本，作为node.js 中构建脚本的一部分。我必须使用JSDOM才能使其工作，但这是我的代码：

const htmlToText = (html) => {
    html = html.replace(/\n/g, "");
    html = html.replace(/\t/g, "");

    html = html.replace(/<\/p>/g, "\n\n");
    html = html.replace(/<\/h1>/g, "\n\n");
    html = html.replace(/<br>/g, "\n");
    html = html.replace(/<br( )*\/>/g, "\n");

    const dom = new JSDOM(html);
    let text = dom.window.document.body.textContent;

    text = text.replace(/  /g, "");
    text = text.replace(/\n /g, "\n");
    text = text.trim();
    return text;
}

score -2 · Accepted Answer

三步。

First get the html as a string.
Second, replace all <BR /> and <BR> with \r\n.
Third, use the regular expression "<(.|\n)*?>" to replace all markup with "".

javascript - 在保留换行符（使用 JavaScript）的同时将 HTML 转换为纯文本最方便的方法是什么？

5 回答 5

Related

Reference