c# - 如何在不下载所有页面源的情况下获取网页标题

Question

我正在寻找一种允许我获取网页标题并将其存储为字符串的方法。

然而，到目前为止我发现的所有解决方案都涉及下载页面的源代码，这对于大量网页来说并不实用。

我能看到的唯一方法是限制字符串的长度，或者它只下载一定数量的字符或在到达标签后停止，但这显然仍然会很大？

谢谢

score 19 · Accepted Answer

由于<title>标签位于 HTML 本身中，因此无法不下载文件以查找“仅标题”。您应该能够下载文件的一部分，直到您读入<title>标签，或者</head>标签然后停止，但您仍然需要下载（至少一部分）文件。

这可以通过HttpWebRequest/HttpWebResponse和从响应流中读取数据来完成，直到我们读入一个<title></title>块或</head>标签。我添加了</head>标签检查，因为在有效的 HTML 中，标题块必须出现在头块内 - 因此，通过此检查，我们在任何情况下都不会解析整个文件（当然，除非没有头块）。

以下应该能够完成这项任务：

string title = "";
try {
    HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
    HttpWebResponse response = (request.GetResponse() as HttpWebResponse);

    using (Stream stream = response.GetResponseStream()) {
        // compiled regex to check for <title></title> block
        Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
        int bytesToRead = 8092;
        byte[] buffer = new byte[bytesToRead];
        string contents = "";
        int length = 0;
        while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) {
            // convert the byte-array to a string and add it to the rest of the
            // contents that have been downloaded so far
            contents += Encoding.UTF8.GetString(buffer, 0, length);

            Match m = titleCheck.Match(contents);
            if (m.Success) {
                // we found a <title></title> match =]
                title = m.Groups[1].Value.ToString();
                break;
            } else if (contents.Contains("</head>")) {
                // reached end of head-block; no title found =[
                break;
            }
        }
    }
} catch (Exception e) {
    Console.WriteLine(e);
}

更新：更新了原始源示例以使用编译Regex和using语句来Stream提高效率和可维护性。

score 2 · Accepted Answer

一种更简单的处理方法是下载它，然后拆分：

    using System;
    using System.Net.Http;

    private async void getSite(string url)
    {
        HttpClient hc = new HttpClient();
        HttpResponseMessage response = await hc.GetAsync(new Uri(url, UriKind.Absolute));
        string source = await response.Content.ReadAsStringAsync();

        //process the source here

    }

要处理源代码，可以使用Getting Content From Between HTML Tags一文中描述的方法

c# - 如何在不下载所有页面源的情况下获取网页标题

2 回答 2

Related

Reference