c# - C# 网络爬虫/解析器/蜘蛛

Question

我是 C# 和 WinForms 的新手，我想创建一个网络爬虫（解析器）——它可以解析网页并分层显示它们。+ 我不知道如何让机器人以特定的超链接深度爬行。

所以我想我有两个问题：

如何使机器人以指定的链接深度爬行？
如何分层显示所有超链接？

PS如果它是代码示例，我会很棒。

PPS 有 1 个按钮 = button1；和 1 个富文本框 = 富文本框 1；

这是我的代码：我知道它非常难看....（所有代码都在一个按钮中）：

public partial class Form1 : Form
{
    public Form1()
    {
        InitializeComponent();
    }

    private void button1_Click(object sender, EventArgs e)
    {

        //Declaration

        HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
        HttpWebResponse response = (HttpWebResponse) request.GetResponse();
        StreamReader sr = new StreamReader(response.GetResponseStream());
        Match m;
        string anotherTest = @"(((ht){1}tp[s]?://)[-a-zA-Z0-9@:%_\+.~#?&\\]+)";
        List<string> savedUrls = new List<string>();
        List<string> titles = new List<string>();

        //Go to this URL:
        string url = UrlTextBox.Text = "http://www.yahoo.com";
        if (!(url.StartsWith("http://") || url.StartsWith("https://")))
            url = "http://" + url;

       //Scrape Whole Html code:
        string s = sr.ReadToEnd();

        try
        {
            // Get Urls:
            m = Regex.Match(s, anotherTest,
                            RegexOptions.IgnoreCase | RegexOptions.Compiled,
                            TimeSpan.FromSeconds(1));

            while (m.Success)
            {
                savedUrls.Add(m.Groups[1].ToString());
                m = m.NextMatch();
            }

            // Get TITLES:
            Match m2 = Regex.Match(s, @"<title>\s*(.+?)\s*</title>");
            if (m2.Success)
            {
                titles.Add(m2.Groups[1].Value);
            }
            //Show Title:
            richTextBox1.Text += titles[0] + "\n";

            //Show Urls:
            TrimUrls(ref savedUrls);
        }
        catch (RegexMatchTimeoutException)
        {
            Console.WriteLine("The matching operation timed out.");
        }

        sr.Close();
    }

    private void TrimUrls(ref List<string> urls)
    {
        List<string> d = urls.Distinct().ToList();
        foreach (var v in d)
        {
            if (v.IndexOf('.') != -1 && v != "http://www.w3.org")
            {
                richTextBox1.Text += v + "\n";
            }
        }
    }

}

}

还有一个问题：有人知道如何像树一样将它保存在 XML 中吗？

score 2 · Accepted Answer

我还强烈推荐您使用HTML Agility Pack。

使用 Html Agility Pack，您可以执行以下操作：

var doc = new HtmlDocument();
doc.LoadHtml(html);
var urls = new List<String>();
doc.DocumentNode.SelectNodes("//a").ForEach(x => 
{
    urls.Add(x.Attributes["href"].Value);
});

编辑：

你可以做这样的事情，但请添加一些异常处理。

public class ParsResult
{
    public ParsResult Parent { get; set; }
    public String Url { get; set; }
    public Int32 Depth { get; set; }
}

__

private readonly List<ParsResult> _results = new List<ParsResult>();
private  Int32 _maxDepth = 5;
public  void Foo(String urlToCheck = null, Int32 depth = 0, ParsResult parent = null)
{
    if (depth >= _maxDepth) return;
    String html;
    using (var wc = new WebClient())
        html = wc.DownloadString(urlToCheck ?? parent.Url);

    var doc = new HtmlDocument();
    doc.LoadHtml(html);
    var aNods = doc.DocumentNode.SelectNodes("//a");
    if (aNods == null || !aNods.Any()) return;
    foreach (var aNode in aNods)
    {
        var url = aNode.Attributes["href"];
        if (url == null)
            continue;
        var result = new ParsResult
        {
            Depth = depth,
            Parent = parent,
            Url = url.Value
        };
        _results.Add(result);
        Console.WriteLine("{0} - {1}", depth, result.Url);
        Foo(depth: depth + 1, parent: result);
}

score -1 · Accepted Answer

如果您需要解析此类结构化数据（xhtml），请尝试查看 xpath：http: //msdn.microsoft.com/en-us/library/ms256086.aspx

（你还应该把你的逻辑放在专用的对象中，而不是让它放在 GUI 层。你以后会喜欢的。）

c# - C# 网络爬虫/解析器/蜘蛛

2 回答 2

Related

Reference