0

我需要从网站上获取时间表。我想将此时间表存储/添加到我的 C# 应用程序中的数据表中。

数据表的结构如下所示:

1. | 天 | 时间 | 状态 |
2. ..1........7:00............在
3. ..1.......9:45.......输出
4. ..1......10:15........在
5. ..1......15:45......OUT
6. ..1........8:45......总计
7. ..2 .. ..

我的 DataTable 的 C# 代码:

DataTable table = new DataTable("Worksheet");
table.Columns.Add("Day");
table.Columns.Add("Time");
table.Columns.Add("Status");

我尝试了不同的变体,但我总是搞乱所有数据。

出于测试目的,我制作了一个带有“文本框”(用于站点路径)和“按钮”(启动进程)的新 Winform

然后我希望 HTMLAgilityPack 获取所有数据。一个例子:

public string[] GREYsource;

public Form1()
{
    InitializeComponent();
}

private void btnSubmit_Click(object sender, EventArgs e)
{
    var doc = new HtmlAgilityPack.HtmlDocument();
    var fileName = txtPath.Text;                    // I downloaded the HTML-File
    doc.Load(fileName);

    string strGREYInner;

    foreach (HtmlNode td in doc.DocumentNode.SelectNodes("//tr[@class=\"tblDataGreyNH\"]"))
    {
        strGREYInner = td.InnerText.Trim();
        string shorted = strGREYInner.Replace("\t", ""); string shorted2 = shorted.Replace("\n\n\n\n", "\n\n\n"); string shorted3 = shorted2.Replace("\n\n\n", "\n\n"); string shorted4 = shorted3.Replace("\n\n", "\n");
        GREYsource = shorted4.Split(new Char[] { '\n', });
    }

    foreach (string str in GREYsource)
    {
        ...
    }
}
  1. 问题:结果包含很多我需要修剪的制表符(/t)和换行符(/n)。
  2. 问题:IMO,这不是一个好方法。这只会抓住 Totaltimes。

它可以做得更好。

这只是我尝试的一个例子(其他代码只是一堆垃圾)

我在下面附上了 HTML 结构:

概述(图片): http://www.abload.de/img/overviewzoj18.png

更深入一点:

<html>
  <head>
  </head>
  <style type="text/css">
  </style>
  <body id="body" onload="handleMenuOverlapLogo();onload_column_expand();;firstElementFocus();">

    <.. some (java)scripts>             /* has to be ignoered. not necessary */
    <.. some other divs>              /* has to be ignoered. not necessary */
    <div id="rowContent">             /* This <div> contains the content i need */
      <div id="titleTab">             /* Title is not necessary */
      </div>                    
      <div id="rowContentInner">          /* Here the content starts */
        <table class="tblList">
          <tbody>
            <tr>              /* not necessary */
            <tr class="tblHeader">      /* not necessary */
            <tr class="tblHeader">      /* not necessary */
            <tr class="tblDataWhiteNH">   /*  IN : */
              <td class="tblHeader" style="font-weight: bold; text-align: right"> In </td>
              <td nowrap="">        /* "tblDataWhiteNH" always contains 7 "td nowrap"
              <td nowrap="">
              <td nowrap="">        /* Example: if it contains a value */
                <table width="100%" border="0" align="center">
                <tbody>
                    <tr>
                      <td width="25%" align="left"> </td>
                      <td nowrap="" width="50%" align="center"> 7:53 </td>  /* value = 7:53 (THIS!) */
                      <td width="25%" align="right"> </td>
                    </tr>
                  </tbody>
                </table>
              </td>
              <td nowrap="">
              <td nowrap="">        /* Example: if it contains no value */
                <table width="100%" border="0" align="center">
                  <tbody>
                    <tr>
                      <td width="25%" align="left"> </td>
                      <td nowrap="" width="50%" align="center">       /* no value = 0:00 (THIS!) */
                      <td width="25%" align="right"> </td>
                    </tr>
                  </tbody>
                </table>
              </td>
              <td nowrap="">
              <td nowrap="">
            <tr class="tblDataWhiteNH">   /* OUT : */
              <td class="tblHeader" style="font-weight: bold; text-align: right"> Out </td>
              <td nowrap="">        /* "tblDataWhiteNH" always contains 7 "td nowrap".
              <td nowrap="">
              <td nowrap="">        /* Example: if it contains a value */
                <table width="100%" border="0" align="center">
                <tbody>
                    <tr>
                      <td width="25%" align="left"> </td>
                      <td nowrap="" width="50%" align="center"> 7:53 </td>  /* value = 7:53 (THIS!) */
                      <td width="25%" align="right"> </td>
                    </tr>
                  </tbody>
                </table>
              </td>
              <td nowrap="">
              <td nowrap="">        /* Example: if it contains no value */
                <table width="100%" border="0" align="center">
                  <tbody>
                    <tr>
                      <td width="25%" align="left"> </td>
                      <td nowrap="" width="50%" align="center">       /* no value = 0:00 (THIS!) */
                      <td width="25%" align="right"> </td>
                    </tr>
                  </tbody>
                </table>
              </td>
              <td nowrap="">
              <td nowrap="">
            <tr class="tblDataGreyNH">    /*  IN : */
            <tr class="tblDataGreyNH">    /* OUT : */
            ...               /* "tblDataGreyNH" is built up the same way like "tblDataWhiteNH".
            ...               /* sometimes there could be more "tblDataWhiteNH" and "tblDataGreyNH". */
            ...               /* Usally there are just the "tblDataWhiteNH"(IN/OUT) */
            <tr class="tblHeader">      /* not necessary */
                            /* It continues f.egs. with "tblDataWhite" if the last above header was a "tblDatagrey" */
                            /* and versa vice ("grey" if there was a "white" before.) */
            <tr class="tblDataWhiteNH">   /* Worked : */
              <td class="tblHeader" style="font-weight: bold; text-align: right"> Total Time </td>
              <td> 07:47 </td>      /* value = 7:47 (THIS!) */
              <td> 04:48 </td>      
              <td> 00:00 </td>      /* no value = 0:00 (THIS!) */
              <td> 00:00 </td>      
              <td> 07:42 </td>      
              <td> 00:00 </td>      
              <td> 00:00 </td>      
            </tr>
            <tr class="tblDataGreyNH">    /* Total : */
              <td class="tblHeader" style="font-weight: bold; text-align: right"> Regular Time </td>
              <td> 07:47 </td>      /* value = 7:47 (THIS!) */
              <td> 04:48 </td>      
              <td> </td>          /* no value = 0:00 (THIS!) */
              <td> </td>          
              <td> 07:42 </td>      
              <td> </td>
              <td> </td>
            </tr>
            <tr class="tblHeader">      /* not necessary */
            <tr valign="top">       /* not necessary */
          </tbody>
        </table>
      </div>
    </div>
  </body>
</html>

原始 HTML 的副本:http: //time.wnb.dk/123/

我希望任何人都可以帮助我让它工作。


好吧,让我用一张图片来解释它。https://www.abload.de/img/eeeqnuwu.png
在图片上,您可以看到网站 + 下表,结果应该是什么样子。

声明数据表不是问题。
主要问题是我无法让 htmlagility 吐出正确的结果,如果确实如此,它几乎是错误的。我尝试的一些选择节点在一段时间后输出混乱。到目前为止,我还无法从网站上的表格中获取“所有”数据,只有一些值,但通常有问题。
所以我实际上正在寻找可以看看这个并可能帮助我找到正确的选择节点的人。

4

1 回答 1

1

不确定我是否完全理解您想要做什么,但这里有一个示例代码可以帮助您入门。我强烈建议您查看XPATH以了解它。

        HtmlDocument doc = new HtmlDocument();
        doc.Load(yourFile);

        // get all TR with a specific class name, starting from root (/), and recursively (//)
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//tr[@class='tblDataGreyNH' or @class='tblDataWhiteNH']"))
        {
            // get all TD below the current node with a specific class name
            HtmlNode inOrOut = node.SelectSingleNode("td[@class='tblHeader']");
            if (inOrOut != null)
            {
                string io = inOrOut.InnerText.Trim();
                Console.WriteLine(io.ToUpper());
                if (io.Contains("Time"))
                {
                    // normalize-space gets rid or whitespaces (\r,\n, etc.)
                    // text() gets the node's inner text
                    foreach (HtmlNode td in node.SelectNodes("td[normalize-space(@class)='' and normalize-space(text())!='' and normalize-space(text())!='00:00']"))
                    {
                        Console.WriteLine("value:" + td.InnerText.Trim());
                    }
                }
            }

            // gets all TD below the current node that define the NOWRAP attribute
            HtmlNodeCollection tdNoWraps = node.SelectNodes("td[@nowrap]"); 
            if (tdNoWraps != null)
            {
                foreach (HtmlNode tdNoWrap in tdNoWraps)
                {
                    string value = tdNoWrap.InnerText.Trim();
                    if (value == string.Empty)
                        continue;

                    Console.WriteLine("value:" + value);
                }
            }
        }

它将从您的示例页面输出:

IN
value:7:47
value:7:46
value:7:45
value:7:51
OUT
value:15:35
value:15:33
value:12:38
value:8:59
IN
value:12:38
value:8:59
OUT
value:15:35
TOTAL TIME
value:07:48
value:07:47
value:07:50
value:01:08
REGULAR TIME
value:07:48
value:07:47
value:07:50
value:01:08
于 2012-09-13T12:17:55.153 回答