c# - 从 HTML 表中存储数据的最佳方式是什么？

Question

我目前正在使用 CsQuery 阅读 HTML 文档。该文档有几个 HTML 表格，我需要在保留结构的同时读取数据。目前，我只有一个字符串列表列表。这是一个包含行列表的表格列表，其中包含包含内容为字符串的单元格列表。

 List<List<List<string>>> page_tables = document_div.Cq().Find("TABLE")
    .Select(table => table.Cq().Find("TR")
               .Select(tr => tr.Cq().Find("td")
                               .Select(td => td.InnerHTML).ToList())
               .ToList())
    .ToList();

有没有更好的方法来存储这些数据，以便我可以轻松访问特定的表格、特定的行和单元格？我正在编写几个处理这个 page_tables 对象的方法，所以我需要先确定它的公式。

score 2 · Accepted Answer

有没有更好的方法来存储这些数据，以便我可以轻松访问特定的表格、特定的行和单元格？

在大多数情况下，格式良好的 HTML 非常适合 XML 结构，因此您可以将其存储为 XML 文档。LINQ to XML 将使查询变得非常容易

XDocument doc = XDocument.parse("<html>...</html>");
var cellData = doc.Descendant("td").Select(x => x.Value);

根据评论，我觉得有必要指出还有其他几种情况可能会失败，例如

当使用 HTML 编码的内容 like 时
<br>使用不需要结束标记的有效 HTML，例如

（话虽如此，这些事情可以通过一些预处理来处理）

总而言之，这绝不是最强大的方法，但是，如果您可以确定要解析的 HTML 符合要求，那么这将是一个非常简洁的解决方案。

score 1 · Accepted Answer

您可以完全 OOP 并编写一些模型类：

// Code kept short, minimal ctors
public class Cell
{
    public string Content {get;set;}
    public Cell() { this.Content = string.Empty; }
}

public class Row
{
    public List<Cell> Cells {get;set;}
    public Row() { this.Cells = new List<Cell>(); }
}

public class Table
{
    public List<Row> Rows {get;set;}
    public Table() { this.Rows = new List<Row>(); }
}

然后将它们填满，例如：

var tables = new List<Table>();
foreach(var table in document_div.Cq().Find("TABLE"))
{
    var t = new Table();
    foreach(var tr in table.Cq().Find("TR"))
    {
        var r = new Row();
        foreach(var td in tr.Cq().Find("td"))
        {
            var c = new Cell();
            c.Contents = td.InnerHTML;
            r.Cells.Add(c);
        }
        t.Rows.Add(r);
    }
    tables.Add(t);
}

// Assuming the HTML was correct, now you have a cleanly organized 
// class structure representing the tables!

var aTable = tables.First();
var firstRow = aTable.Rows.First();
var firstCell = firstRow.Cells.First();
var firstCellContents = firstCell.Contents;
...

我可能会选择这种方法，因为我总是更喜欢确切地知道我的数据是什么样的，尤其是当我从外部/不安全/不可靠的来源解析时。

score 0 · Accepted Answer

有没有更好的方法来存储这些数据，以便我可以轻松访问特定的表格、特定的行和单元格？

如果您想轻松访问表格数据，请创建一个类，该类将保存表格行中的数据，并为相应的列提供命名良好的属性。例如，如果您有用户表

<table>
    <tr><td>1</td><td>Bob</td></tr>
    <tr><td>2</td><td>Joe</td></tr>
</table>

我将创建以下类来保存行数据：

public class User
{
    public int Id { get; set; }
    public string Name { get; set; }
}

第二步是从 HTML 中解析用户。我建议使用HtmlAgilityPack（可从 NuGet 获得）来解析 HTML：

HtmlDocument doc = new HtmlDocument();            
doc.Load("index.html");
var users = from r in doc.DocumentNode.SelectNodes("//table/tr")
            let cells = r.SelectNodes("td")
            select new User
            {
                Id = Int32.Parse(cells[0].InnerText),
                Name = cells[1].InnerText
            };
// NOTE: you can check cells count before accessing them by index

现在您拥有了强类型用户对象的集合（您可以将它们保存到列表、数组或字典中——这取决于您将如何使用它们）。例如

 var usersDictionary = users.ToDictionary(u => u.Id);
 // Getting user by id
 var user = usersDictionary[2];
 // now you can read user.Name

score 0 · Accepted Answer

Since your parsing an HTML table. Could you use an ADO.Net DataTable? If the content doesn't have too many row or col spans this may be an option, you wouldn't have to roll your own and it could be easily saved to a database or list of entities or whatever. Plus you get the benefit of strongly typed data types. As long as the HTML tables are consistent I would prefer an approach like this to make interoperability with the rest of the framework seamless and a ton less work.

c# - 从 HTML 表中存储数据的最佳方式是什么？

4 回答 4

Related

Reference