javascript - 如何将整个文档 HTML 作为字符串获取？

Question

JS中有没有办法在html标签中获取整个HTML，作为一个字符串？

document.documentElement.??

score 375 · Accepted Answer

MS前段时间添加了outerHTMLand属性。innerHTML

据MDN称，outerHTMLFirefox 11、Chrome 0.2、Internet Explorer 4.0、Opera 7、Safari 1.3、Android、Firefox Mobile 11、IE Mobile、Opera Mobile 和 Safari Mobile 都支持。outerHTML在DOM 解析和序列化规范中。

有关适合您的浏览器兼容性，请参阅quirksmode 。都支持innerHTML。

var markup = document.documentElement.innerHTML;
alert(markup);

score 109 · Accepted Answer

你可以做

new XMLSerializer().serializeToString(document)

在比 IE 9 更新的浏览器中

请参阅https://caniuse.com/#feat=xml-serializer

score 49 · Accepted Answer

我相信document.documentElement.outerHTML应该还给你。

据MDN称，outerHTMLFirefox 11、Chrome 0.2、Internet Explorer 4.0、Opera 7、Safari 1.3、Android、Firefox Mobile 11、IE Mobile、Opera Mobile 和 Safari Mobile 都支持。outerHTML在DOM 解析和序列化规范中。

该outerHTML属性上的 MSDN 页面指出它在 IE 5+ 中受支持。Colin 的回答链接到 W3C quirksmode 页面，该页面提供了一个很好的跨浏览器兼容性比较（也适用于其他 DOM 功能）。

score 48 · Accepted Answer

我尝试了各种答案以查看返回的内容。我正在使用最新版本的 Chrome。

建议document.documentElement.innerHTML;返回<head> ... </body>

盖比的建议document.getElementsByTagName('html')[0].innerHTML;得到了同样的回报。

document.documentElement.outerHTML;返回的建议<html><head> ... </body></html> 是除“doctype”之外的所有内容。

您可以使用返回一个对象而不是字符串来检索 doctype 对象document.doctype; ，因此如果您需要将详细信息提取为所有 doctype 的字符串（包括 HTML5 在内），请参见此处：Get DocType of an HTML as string with Javascript

我只想要 HTML5，所以以下内容足以让我创建整个文档：

alert('<!DOCTYPE HTML>' + '\n' + document.documentElement.outerHTML);

score 10 · Accepted Answer

你也可以这样做：

document.getElementsByTagName('html')[0].innerHTML

你不会得到 Doctype 或 html 标签，但其他一切......

score 7 · Accepted Answer

7

document.documentElement.outerHTML

于 2009-05-03T14:36:27.023 回答

score 4 · Accepted Answer

可能只有 IE：

>     webBrowser1.DocumentText

对于 FF 从 1.0 开始：

//serialize current DOM-Tree incl. changes/edits to ss-variable
var ns = new XMLSerializer();
var ss= ns.serializeToString(document);
alert(ss.substr(0,300));

可以在FF工作。（显示源文本最开头的 VERY FIRST 300 个字符，主要是 doctype-defs。）

但请注意，FF 的正常“另存为”对话框可能不会保存页面的当前状态，而是最初加载的 X/h/tml-source-text ！（将 ss 发布到某个临时文件并重定向到该文件可能会提供可保存的源文本，其中包含之前对其进行的更改/编辑。）

尽管 FF 对“返回”的良好恢复以及在“保存（作为）...”中包含类似输入的字段、文本区域等的状态/值的 NICE 的包含感到惊讶，而不是在 contenteditable/designMode 中的元素...

如果不是 xhtml-resp。xml 文件（mime 类型，不仅仅是文件扩展名！），可以使用 document.open/write/close 来设置 appr。内容到源层，将从FF的文件/保存菜单保存在用户的保存对话框中。请参阅： http ://www.w3.org/MarkUp/2004/xhtml-faq#docwrite resp。

https://developer.mozilla.org/en-US/docs/Web/API/document.write

对 X(ht)ML 的问题中立，尝试使用“view-source:http://...”作为（脚本制作的！？） iframe 的 src-attrib 的值，-访问 iframe- FF中的文件：

<iframe-elementnode>.contentDocument，请参阅谷歌“mdn contentDocument”以获取 appr。成员，例如“textContent”。'几年前就有了，不喜欢爬。如果仍然有迫切需要，请提及这一点，我必须潜入...

score 3 · Accepted Answer

3

document.documentElement.innerHTML

于 2009-05-03T14:37:47.897 回答

score 2 · Accepted Answer

<html>...</html>要在最重要的声明之外获取内容<!DOCTYPE ...>，您可以遍历 document.childNodes，将每个节点转换为字符串：

const html = [...document.childNodes]
    .map(node => nodeToString(node))
    .join('\n') // could use '' instead, but whitespace should not matter.

function nodeToString(node) {
    switch (node.nodeType) {
        case node.ELEMENT_NODE:
            return node.outerHTML
        case node.TEXT_NODE:
            // Text nodes should probably never be encountered, but handling them anyway.
            return node.textContent
        case node.COMMENT_NODE:
            return `<!--${node.textContent}-->`
        case node.DOCUMENT_TYPE_NODE:
            return doctypeToString(node)
        default:
            throw new TypeError(`Unexpected node type: ${node.nodeType}`)
    }
}

我在 npm 上将此代码发布为document-outerhtml。

编辑注意上面的代码依赖于一个函数doctypeToString；它的实现可能如下（下面的代码在 npm 上发布为doctype-to-string）：

function doctypeToString(doctype) {
    if (doctype === null) {
        return ''
    }
    // Checking with instanceof DocumentType might be neater, but how to get a
    // reference to DocumentType without assuming it to be available globally?
    // To play nice with custom DOM implementations, we resort to duck-typing.
    if (!doctype
        || doctype.nodeType !== doctype.DOCUMENT_TYPE_NODE
        || typeof doctype.name !== 'string'
        || typeof doctype.publicId !== 'string'
        || typeof doctype.systemId !== 'string'
    ) {
        throw new TypeError('Expected a DocumentType')
    }
    const doctypeString = `<!DOCTYPE ${doctype.name}`
        + (doctype.publicId ? ` PUBLIC "${doctype.publicId}"` : '')
        + (doctype.systemId
            ? (doctype.publicId ? `` : ` SYSTEM`) + ` "${doctype.systemId}"`
            : ``)
        + `>`
    return doctypeString
}

score 1 · Accepted Answer

我总是用

document.getElementsByTagName('html')[0].innerHTML

可能不是正确的方法，但当我看到它时我能理解它。

score 0 · Accepted Answer

使用document.documentElement.

同样的问题在这里回答： https ://stackoverflow.com/a/7289396/2164160

score 0 · Accepted Answer

我只需要 doctype html 并且应该可以在 IE11、Edge 和 Chrome 中正常工作。我使用下面的代码它工作正常。

function downloadPage(element, event) {
    var isChrome = /Chrome/.test(navigator.userAgent) && /Google Inc/.test(navigator.vendor);

    if ((navigator.userAgent.indexOf("MSIE") != -1) || (!!document.documentMode == true)) {
        document.execCommand('SaveAs', '1', 'page.html');
        event.preventDefault();
    } else {
        if(isChrome) {
            element.setAttribute('href','data:text/html;charset=UTF-8,'+encodeURIComponent('<!doctype html>' + document.documentElement.outerHTML));
        }
        element.setAttribute('download', 'page.html');
    }
}

并在您的锚标记中像这样使用。

<a href="#" onclick="downloadPage(this,event);" download>Download entire page.</a>

例子

    function downloadPage(element, event) {
    	var isChrome = /Chrome/.test(navigator.userAgent) && /Google Inc/.test(navigator.vendor);
    
    	if ((navigator.userAgent.indexOf("MSIE") != -1) || (!!document.documentMode == true)) {
    		document.execCommand('SaveAs', '1', 'page.html');
    		event.preventDefault();
    	} else {
    		if(isChrome) {
                element.setAttribute('href','data:text/html;charset=UTF-8,'+encodeURIComponent('<!doctype html>' + document.documentElement.outerHTML));
    		}
    		element.setAttribute('download', 'page.html');
    	}
    }

I just need doctype html and should work fine in IE11, Edge and Chrome. 

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

<p>
<a href="#" onclick="downloadPage(this,event);"  download><h2>Download entire page.</h2></a></p>

<p>Some image here</p>

<p><img src="https://placeimg.com/250/150/animals"/></p>

score 0 · Accepted Answer

我正在使用outerHTML元素（主<html>容器），以及XMLSerializer其他任何东西，包括容器<!DOCTYPE>外的随机注释<html>或任何其他可能存在的东西。似乎没有在<html>元素之外保留空格，所以我默认使用sep="\n".

function get_document_html(sep="\n") {
    let html = "";
    let xml = new XMLSerializer();
    for (let n of document.childNodes) {
        if (n.nodeType == Node.ELEMENT_NODE)
            html += n.outerHTML + sep;
        else
            html += xml.serializeToString(n) + sep;
    }
    return html;
}

console.log(get_document_html().slice(0, 200));

score 0 · Accepted Answer

如果您想获取 DOCTYPE 之外的所有内容，这将起作用：

document.getElementsByTagName('html')[0].outerHTML;

或者如果你也想要 doctype 的话：

new XMLSerializer().serializeToString(document.doctype) + document.getElementsByTagName('html')[0].outerHTML;

score -2 · Accepted Answer

您必须遍历文档 childNodes 并获取 outerHTML 内容。

在 VBA 中它看起来像这样

For Each e In document.ChildNodes
    Put ff, , e.outerHTML & vbCrLf
Next e

使用它，允许您获取网页的所有元素，包括 < !DOCTYPE > 节点（如果存在）

score -9 · Accepted Answer

-9

正确的做法其实是：

webBrowser1.DocumentText

于 2010-10-29T15:05:31.660 回答

javascript - 如何将整个文档 HTML 作为字符串获取？

16 回答 16

Related

Reference