c# - C# HttpWebRequest - 如何在不下载的情况下区分 HTML 和 XML 页面？

Question

我需要能够通过查看标题或类似内容（无需下载）来判断链接（URL）是否指向 XML 文件（RSS 提要）或常规 HTML 文件

那里对我有什么好的建议吗？:)

谢谢！罗伊

score 11 · Accepted Answer

你可以只做一个 HEAD 请求而不是一个完整的 POST/GET

这将为您提供该页面的标题，其中应包含内容类型。从中你应该能够区分它的文本/html或xml

这里有一个很好的例子

score 5 · Accepted Answer

跟进 Eoin Campbell 的回应，这里有一个代码片段，应该使用该System.Net功能完全做到这一点：

using (var request = System.Net.HttpWebRequest.Create(
    "http://tempuri.org/pathToFile"))
{
    request.Method = "HEAD";

    using (var response = request.GetResponse())
    {
        switch (response.ContentType)
        {
            case "text/xml":
                // ...
                break;
            case "text/html":
                // ...
                break;
        }
    }
}

Of course, this assumes that the web server publishes the content (MIME) type and does so correctly. But since you stated that want a bandwidth-efficient way of doing this, I assume you don't want to download all the markup and analyse that! To be honest, the content type is usually set correctly in any case.

score 2 · Accepted Answer

您可以使用Content-Type标题，并且为了节省带宽，您可以强制 Web 服务器为您提供文档的指定部分。如果服务器Accept-Ranges: bytes在其响应中包含标头，您可以使用Range: bytes=0-10仅下载前十个字节（甚至尝试不下载任何内容）。

也研究HEAD动词而不是GET。

score 1 · Accepted Answer

Check the headers in your HttpWebResponse object. The Content-Type header should read text/xml for an XML/RSS document and text/html for a standard web page.

score 0 · Accepted Answer

您无法仅通过查看 URL 来找出它是什么文件类型。

我建议您尝试检查您请求的文档的 MIME-type，或者阅读第一行并希望作者输入了 Doctype。

score 0 · Accepted Answer

Generally speaking, this impossible. This is because it is possible (though unhelpful) to serve either HTML or XML files as application/octet-stream. Also, as noted by others, there are multiple valid XML mime types. However, a HEAD request then content type check could work sometimes:

WebRequest req = WebRequest.Create(url);
WebResponse resp = req.GetResponse();
req.Method = "HEAD";
String contentType = resp.ContentType;

if(contentType == "text/xml")
  getXML(url);
else if(contentType == "text/html")
  getHTML(url);

But if you're going to process it somehow either way, you can do:

WebRequest req = WebRequest.Create(url);
WebResponse resp = req.GetResponse();
String contentType = resp.ContentType;

if(contentType == "text/xml")
  processXML(resp.GetResponseStream());
else if(contentType == "text/html")
  processHTML(resp.GetResponseStream());
else
  // process error condition

Keep in mind, files are downloaded on an as-needed basis. So just asking for the response object does not cause the whole file to be downloaded.

score -3 · Accepted Answer

只需在“文本”阅读器中阅读即可。然后决定哪个是最好的，例如，寻找一些想到的标签；）然后将其放入您的实际阅读器中。

还是那太简单了？

c# - C# HttpWebRequest - 如何在不下载的情况下区分 HTML 和 XML 页面？

7 回答 7

Related

Reference