我正在使用 DOM 来解析一些网站。我正在解析这个:
<option value="A26JUYT14N57PY">Aleksander's Kindle Cloud Reader</option>
<option value="A13400OMTGFDRH">Aleksander's Kindle for PC</option>
<optgroup label="----OR----" style="color:#999;font-style:normal;font-weight:normal"> </optgroup>
<option value="add-new">Register a new Kindle</option>
我的脚本是:
$dom->getElementsByTagName('option');
foreach($options as $option)
{
$attr = $option->getAttribute('value');
$value = $option->nodeValue;
}
在我的装有 PHP 5.3.9 的电脑上,它可以正常工作:
$attr1 = "A26JUYT14N57PY";
$value1 = "Aleksander's Kindle Cloud Reader";
$attr2 = "A13400OMTGFDRH";
$value2 = "Aleksander's Kindle for PC";
$attr3 = "add-new";
$value3 = "Register a new Kindle";
但是当我在服务器上上传脚本时,这不再起作用(我不确定它是什么 PHP 版本,但它 < 5.3.0)。结果是:
$attr1 = "A26JUYT14N57PY";
$value1 = "'";
$attr2 = "A13400OMTGFDRH";
$value2 = "'";
$attr3 = "add-new";
$value3 = "";
所以nodeValues中的字符串只剩下撇号 - 我认为这是编码的东西,但我不确定......奇怪的是只有nodeValues是错误的并且值属性是好的......
- - - - - - - 编辑
这是代码解析网页(它使用的类的来源在上面)。
$page
是 CURL 返回的网页的 html 源代码 - 我不能给你直接的 url,因为它是在登录亚马逊之后。
$dom = HtmlDomParser::getDomFromHtml($page);
$form = FormDomParser::getFormByName($dom,$this->amazon_config->buy_form_name);
if($form===false)
{
throw new AmazonParseException("Couldn't parse buy form");
}
$select = FormDomParser::getSelectByName($dom,$this->amazon_config->buy_deliveryoptions_name);
if($select === false)
{
throw new AmazonParseException("Couldn't parse options select");
}
$options = FormDomParser::getOptions($select);
$result = array();
foreach($options as $option)
{
//$value = $option->childNodes->item(0)->nodeValue;
//print_r($value);
$device_id = $option->getAttribute('value');
$device_name = $option->nodeValue;
echo $device_id.' = '.$device_name.'</br>';
}
HtmlDomParser
// simples class for parsing html files with DOM
class HtmlDomParser
{
// converts html (as string) to DOM object
public static function getDomFromHtml($html)
{
$dom = new DOMDocument;
$dom->loadHTML($html);
return $dom;
}
// gets all occurances of specified tag from dom object
// these tags must contain specified (in attributes array) attributes
public static function getTagsByAttributes($dom,$tag,$attributes = array())
{
$result = array();
$elements = $dom->getElementsByTagName($tag);
foreach($elements as $element)
{
$attributes_ok = true;
foreach($attributes as $key => $value)
{
if($element->getAttribute($key)!=$value)
{
$attributes_ok = false;
break;
}
}
if($attributes_ok)
{
$result[] = $element;
}
}
return $result;
}
}
FormDomParser
class FormDomParser
{
// gets form (as dom object) with specified name
public static function getFormByName($dom,$form_name)
{
$attributes['name'] = $form_name;
$forms = HtmlDomParser::getTagsByAttributes($dom,'form',$attributes);
if(count($forms)<1)
{
return false;
}
else
{
return $forms[0];
}
}
// gets all <input ...> tags from specified DOM object
public static function getInputs($dom)
{
$inputs = HtmlDomParser::getTagsByAttributes($dom,'input');
return $inputs;
}
// internal / converts array of Dom objects into assiosiative array
public static function convertInputsToArray($inputs)
{
$inputs_array = array();
foreach($inputs as $input)
{
$name = $input->getAttribute('name');
$value = $input->getAttribute('value');
if($name!='')
{
$inputs_array[$name] = $value;
}
}
return $inputs_array;
}
// gets all <select ...> tags from DOM object
public static function getSelects($dom)
{
$selects = HtmlDomParser::getTagsByAttributes($dom,'select');
return $selects;
}
// gets <select ...> tag with specified name from DOM object
public static function getSelectByName($dom,$name)
{
$attributes['name'] = $name;
$selects = HtmlDomParser::getTagsByAttributes($dom,'select',$attributes);
if(count($selects)<1)
{
return false;
}
else
{
return $selects[0];
}
}
// gets <option ...> tags from DOM object
public static function getOptions($dom)
{
$options = HtmlDomParser::getTagsByAttributes($dom,'option');
return $options;
}
// gets action value from form (as DOM object)
public static function getAction($dom)
{
$action = $dom->getAttribute('action');
if($action == "")
{
return false;
}
else
{
return $action;
}
}
}
- - - - - 编辑
这是我要解析的 http 标头 od 站点(由 curl 返回):
HTTP/1.1 200 OK Date: Fri, 11 May 2012 08:54:23 GMT Server: Server x-amz-id-1:
0CHN2KA4VD4FTXF7K62J p3p: policyref="http://www.amazon.com/w3c/p3p.xml",CP="CAO
DSP LAW CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN
COM NAV INT DEM CNT STA HEA PRE LOC GOV OTC " x-frame-options: SAMEORIGIN
x-amz-id-2: fFWynUQG0oqudmoDO+2FEraC2H+wWl0p9RpOyGxwyXKOc9u/6f2v8ffWUFkaUKU6
Vary: Accept-Encoding,User-Agent Content-Type: text/html; charset=ISO-8859-1
Set-cookie: ubid-main=190-8691333-9825146; path=/; domain=.amazon.com;
expires=Tue, 01-Jan-2036 08:00:01 GMT Set-cookie: session-id-time=2082787201l;
path=/; domain=.amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT Set-cookie:
session-id=187-8097468-1751521; path=/; domain=.amazon.com; expires=Tue,
01-Jan-2036 08:00:01 GMT Transfer-Encoding: chunked
- - - - - - - - - - - - 编辑
我刚刚使用了http://simplehtmldom.sourceforge.net,效果很好。