php - PHP/HTML - 多页屏幕抓取，导出到 .txt，日期和值之间有逗号

Question

我正在尝试抓取网页（请参阅代码） - 以及那些及时返回的页面（您可以在页面本身中看到日期“20110509”） - 以获得简单的数字字符串。我似乎无法通过反复试验（我是编程新手）弄清楚如何解析我想要的表中的特定数据。我一直在尝试使用没有 curl 或其他类似东西的简单 PHP/HTML。这可能吗？我认为我的主要问题是使用从源代码中获取数据所必需的分隔符。

我想要的是程序从它可以的第一页开始，例如'20050101'，并扫描每一页直到当前日期，获取特定数据，例如“最新关闭”（列)、“关闭臂”（行），并将相应日期的值导出到单个 .txt 文件中，日期与值用逗号分隔。每次运行程序时，都应将日期/值附加到现有的文本文件中。

我知道下面的许多代码行都是垃圾，这是我学习过程的一部分。

<html>
<title>HTML with PHP</title>
<body>

<?php

$rawdata = file_get_contents('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2-20110509.html?mod=mdc_pastcalendar');
//$data = substr(' ', $data);
//$begindate = '20050101';
//$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B"); 
//if (preg_match(' <td class="text"> ' , $data , $content)) {
//$content = str_replace($newlines

echo $rawdata;
///file_put_contents( 'NYSETRIN.html' , $content , FILE_APPEND);

?>

<b>some more html</b>

<?php
?>

</body>
</html>

score 3 · Accepted Answer

好吧，让我们这样做。我们将首先将数据加载到 HTML 解析器中，然后从中创建一个 XPath 解析器。XPath 将帮助我们轻松地浏览 HTML。所以：

$date = "20110509";
$data = file_get_contents("http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2-{$date}.html?mod=mdc_pastcalendar");

$doc = new DOMDocument();
@$doc->loadHTML($data);

$xpath = new DOMXpath($doc);

现在我们需要获取一些数据。首先让我们获取所有数据表。查看源代码，这些表由以下类表示mdcTable：

$result = $xpath->query("//table[@class='mdcTable']");
echo "Tables found: {$result->length}\n";

至今：

$ php test.php
Tables found: 5

好的，所以我们有桌子。现在我们需要获取特定的列。因此，让我们使用您提到的最新关闭列：

$result = $xpath->query("//table[@class='mdcTable']/*/td[contains(.,'Latest close')]");
foreach($result as $td) {
  echo "Column contains: {$td->nodeValue}\n";
}

到目前为止的结果：

$ php test.php
Column contains: Latest close
Column contains: Latest close
Column contains: Latest close
... etc ...

现在我们需要列索引来获取特定行的特定列。我们通过计算所有先前的兄弟元素，然后添加一个来做到这一点。这是因为元素索引选择器是 1 索引的，而不是 0 索引的：

$result = $xpath->query("//table[@class='mdcTable']/*/td[contains(.,'Latest close')]");
$column_position = count($xpath->query('preceding::*', $result->item(0))) + 1;
echo "Position is: $column_position\n";

结果是：

$ php test.php
Position is: 2

现在我们需要获取我们的特定行：

$data_row = $xpath->query("//table[@class='mdcTable']/*/td[starts-with(.,'Closing Arms')]");
echo "Returned {$data_row->length} row(s)\n";

这里我们使用starts-with，因为行标签中有一个 utf-8 符号。这使它更容易。到目前为止的结果：

$ php test.php
Returned 4 row(s)

现在我们需要使用列索引来获取我们想要的数据：

$data_row = $xpath->query("//table[@class='mdcTable']/*/td[starts-with(.,'Closing Arms')]/../*[$column_position]");
foreach($data_row as $row) {
  echo "{$date},{$row->nodeValue}\n";
}

结果是：

$ php test.php
20110509,1.26
20110509,1.40
20110509,0.32
20110509,1.01

现在可以将其写入文件。现在，我们没有这些适用的市场，所以让我们继续抓住那些：

$headings = array();
$market_headings = $xpath->query("//table[@class='mdcTable']/*/td[@class='colhead'][1]");
foreach($market_headings as $market_heading) {
  $headings[] = $market_heading->nodeValue;
}

现在我们可以使用计数器来引用我们所在的市场：

$data_row = $xpath->query("//table[@class='mdcTable']/*/td[starts-with(.,'Closing Arms')]/../*[$column_position]");
$i = 0;
foreach($data_row as $row) {
  echo "{$date},{$headings[$i]},{$row->nodeValue}\n";
  $i++;
}

输出为：

$ php test.php
20110509,NYSE,1.26
20110509,Nasdaq,1.40
20110509,NYSE Amex,0.32
20110509,NYSE Arca,1.01

现在为您服务：

这可以做成一个带日期的函数
您需要代码来写出文件。查看文件系统函数以获取提示
这可以扩展为使用不同的列和不同的行

score 2 · Accepted Answer

我推荐使用HTML Agility Pack，它是一个 HTML 解析器，对于在 HTML 文档中查找特定内容非常方便。

php - PHP/HTML - 多页屏幕抓取，导出到 .txt，日期和值之间有逗号

2 回答 2

Related

Reference