c# - 如何从维基百科获取表格

Question

我想将 Wikipedia 中的一张表放入 xml 文件中，然后将其解析为 C#。是否可以？如果是，我可以只在 xml 中保存Title和Genre列吗？

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/2012_in_film");

HtmlNode node = doc.DocumentNode.SelectSingleNode("//table[@class='wikitable']");

score 1 · Accepted Answer

您可以使用以下代码：搜索您要搜索的 html 标记并制作正则表达式来解析其余数据。此代码将搜索宽度为 150 的表格并获取所有 url/nav url。

HtmlElementCollection links = webBrowser1.Document.GetElementsByTagName("table"); //get collection in link
                {
                    foreach (HtmlElement link_data in links) //parse for each collection
                    {
                        String width = link_data.GetAttribute("width");
                        {
                            if (width != null && width == "150")
                            {
                                Regex linkX = new Regex("<a[^>]*?href=\"(?<href>[\\s\\S]*?)\"[^>]*?>(?<Title>[\\s\\S]*?)</a>", RegexOptions.IgnoreCase);
                                MatchCollection category_urls = linkX.Matches(link_data.OuterHtml);
                                if (category_urls.Count > 0)
                                {
                                    foreach (Match match in category_urls)
                                    {
                                           //rest of the code
                                    }
                                }
                             }
                         }
                     }
                }

score 1 · Accepted Answer

您可以使用网络浏览器：

//First navigate to your address
 webBrowser1.Navigate("http://en.wikipedia.org/wiki/2012_in_film");
        List<string> Genre = new List<string>();
        List<string> Title = new List<string>();
  //When page loaded
  foreach (HtmlElement table in webBrowser1.Document.GetElementsByTagName("table"))
            {
                if (table.GetAttribute("className").Equals("wikitable"))
                {
                    foreach (HtmlElement tr in table.GetElementsByTagName("tr"))
                    {
                        int columncount = 1;
                        foreach (HtmlElement td in tr.GetElementsByTagName("td"))
                        {
                            //Title
                            if (columncount == 4)
                            {
                                Title.Add(td.InnerText);
                            }
                            //Genre
                            if (columncount == 7)
                            {
                                Genre.Add(td.InnerText);
                            }
                            columncount++;
                        }

                    }
                }
            }

现在你有两个列表（流派和标题）。您可以简单地将它们转换为 xml 文件

score 1 · Accepted Answer

还可以考虑在维基百科页面的特定部分查看 Wikipedia API 以归零

https://en.wikipedia.org/w/api.php?action=parse&page=2012_in_film&mobileformat=html§ion=1&prop=wikitext

API 文档描述了如何格式化搜索结果以供后续解析。

c# - 如何从维基百科获取表格

3 回答 3

Related

Reference