php - 网络抓取删除没有为 php 附加 id/类的链接

Question

嗨，我使用网络抓取网站，但它包含太多我不需要的信息。这是我的代码：

<?php
require('phpQuery.php');
$url = 'http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A';
$html = file_get_contents($url);
$pq = phpQuery::newDocumentHTML($html);
echo $pq['#CompanylistResults'];
?>

结果是：

<table id="CompanylistResults">
<tbody>
<tr>
<tr>
<td>
<a target="_blank" rel="nofollow" href="http://www.1800flowers.com">1-800 FLOWERS.COM, Inc.</a>
</td>
<td>
<td style="">$100.55M</td>
<td style="display:none"></td>
<td>United States</td>
<td>1999</td>
<td style="width:105px">Other Specialty Stores</td>

我需要的是“1-800 FLOWERS.COM, Inc.” 和文本中的“$ 100.55M”，怎么做？

score 0 · Accepted Answer

试试这个代码：

//the url you need to scrape
$uri = ('http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A');
//extracts HTML from the url
$get = file_get_contents($uri);

//Finding what you want removed
$pos1 = strpos($get, "<a target=\"_blank\" rel=\"nofollow\" href=\"http://www.1800flowers.com\">");
$pos2 = strpos($get, "</a>", $pos1);

$pos3 = strpos($get, "<td style=\"\">");
$pos4 = strpos($get, "</td>", $pos3);

//Removing the parts that are not needed
$text = substr($get,$pos1,$pos2-$pos1);
$test3 = substr($get,$pos3,$pos4-$pos3);

//Removing tags from is left after the above code, you should now have only the values that you are looking for
$text1 = strip_tags($text);
$text2 = strip tags($text3);

php - 网络抓取删除没有为 php 附加 id/类的链接

1 回答 1

Related

Reference