c# - 使用 C# 的图像抓取工具

Question

我正在尝试浏览网页源代码，将其添加<img src="http://www.dot.com/image.jpg"到HtmlElementCollection. 然后，我尝试使用 foreach 循环遍历元素集合中的每个元素，并通过 url 下载图像。

这是我到目前为止所拥有的。我现在的问题是什么都没有下载，我认为我的元素没有被标签名正确添加。如果是，我似乎无法参考它们进行下载。

public partial class Form1 : Form
{
    public Form1()
    {
        InitializeComponent();
    }

    public void button1_Click(object sender, EventArgs e)
    {
        string url = urlTextBox.Text;
        string sourceCode = WorkerClass.ScreenScrape(url);
        StreamWriter sw = new StreamWriter("sourceScraped.html");
        sw.Write(sourceCode);
    }

    private void button2_Click(object sender, EventArgs e)
    {
        string url = urlTextBox.Text;
        WebBrowser browser = new WebBrowser();
        browser.Navigate(url);
        HtmlElementCollection collection;
        List<HtmlElement> imgListString = new List<HtmlElement>();
        if (browser != null)
        {
            if (browser.Document != null)
            {
                collection = browser.Document.GetElementsByTagName("img");
                if (collection != null)
                {
                    foreach (HtmlElement element in collection)
                    {
                        WebClient wClient = new WebClient();
                        string urlDownload = element.FirstChild.GetAttribute("src");
                        wClient.DownloadFile(urlDownload, urlDownload.Substring(urlDownload.LastIndexOf('/')));
                    }
                }
            }
        }
    }
}

}

score 3 · Accepted Answer

那些你称之为导航的，你假设文档已准备好遍历并检查图像。但实际上加载需要一些时间。您需要等到文档加载完成。

将事件添加DocumentCompleted到您的浏览器对象

 browser.DocumentCompleted += browser_DocumentCompleted;

将其实现为

static void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    WebBrowser browser = (WebBrowser)sender;
    HtmlElementCollection collection;
    List<HtmlElement> imgListString = new List<HtmlElement>();
    if (browser != null)
    {
        if (browser.Document != null)
        {
            collection = browser.Document.GetElementsByTagName("img");
            if (collection != null)
            {
                foreach (HtmlElement element in collection)
                {
                    WebClient wClient = new WebClient();
                    string urlDownload = element.GetAttribute("src");
                    wClient.DownloadFile(urlDownload, urlDownload.Substring(urlDownload.LastIndexOf('/')));
                }
            }
        }
    }
}

score 0 · Accepted Answer

对于任何感兴趣的人，这是解决方案。这正是达米特所说的。我发现 Html Agility Pack 相当糟糕。那是我尝试使用的第一件事。这最终对我来说是一个更可行的解决方案，这是我的最终代码。

private void button2_Click(object sender, EventArgs e)
    {
        string url = urlTextBox.Text;
        WebBrowser browser = new WebBrowser();
        browser.Navigate(url);
        browser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(DownloadFiles);
    }

    private void DownloadFiles(object sender, WebBrowserDocumentCompletedEventArgs e)
    {

        HtmlElementCollection collection;
        List<HtmlElement> imgListString = new List<HtmlElement>();

        if (browser != null)
        {
            if (browser.Document != null)
            {
                collection = browser.Document.GetElementsByTagName("img");
                if (collection != null)
                {
                    foreach (HtmlElement element in collection)
                    {
                        string urlDownload = element.GetAttribute("src");
                        if (urlDownload != null && urlDownload.Length != 0)
                        {
                            WebClient wClient = new WebClient();
                            wClient.DownloadFile(urlDownload, "C:\\users\\folder\\location\\" + urlDownload.Substring(urlDownload.LastIndexOf('/')));
                        }
                    }
                }
            }
        }
    }
}

}

score 0 · Accepted Answer

看看Html Agility Pack。

您需要做的是下载并解析 HTML，然后处理您感兴趣的元素。它是完成此类任务的好工具。

c# - 使用 C# 的图像抓取工具

3 回答 3

Related

Reference