0

我有一个任务来解析http://www.olx.in/cars-cat-378以使用正则表达式获取汽车、位置和价格。我看到很多帖子都暗示正则表达式不适合解析网页,但至少这次我仍然必须使用它。我已经尝试过如下所示的方式。但这不起作用。

<?php

 /**
 * Initialize the cURL session
 */
 $ch = curl_init();


 /**
 * Set the URL of the page or file to download.
 */
 curl_setopt($ch, CURLOPT_URL, 'http://www.olx.in/cars-cat-378');

 /**
 * Ask cURL to return the contents in a variable instead of simply echoing them to  the browser.
 */
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

 /**
 * Execute the cURL session
 */
 $contents = curl_exec($ch);
 /*
  print the $contents variable
 */
 $reg='/<div class="li .*?"><div class="row clearfix"><div class="c-1 table-cell"><div class="cropit">.*?<\/div><\/div><div class="second-column-container  table-cell"><h3><a .*?>(.*?)<\/a><\/h3><div class="c-4"><span>(.*?)<\/span> - <span>(.*?)<\/span> - <span>(.*?)<\/span> - <span>(.*?)<\/span><\/div><span class="itemlistinginfo clearfix"><a .*?>(.*?)<\/a><\/span><div .*?><\/div><\/div><div class="third-column-container table-cell">(.*?)<\/div><div class="fourth-column-container table-cell">(.*?)<\/div><\/div><\/div>/';

 preg_match($reg,$contents,$result);

 var_dump($result);

 /**
 * Close cURL session
 */
 curl_close ($ch);





?>

页面每个列表项的html如下----

<div class="li even">
       <div class="row clearfix">
           <div class="c-1 table-cell">
                <div class="cropit">
                    <a class="pics-lnk" href="http://newdelhi.olx.in/honda-prelude-2-door-sports-car-for-sale-iid-437128570">
                        <img src="http://images04.olx-st.com/ui/14/85/70/t_1347220402_437128570_4.jpg" width="111"
                            alt="HONDA PRELUDE,,2 DOOR ,,SPORTS CAR FOR SALE." title="HONDA PRELUDE,,2 DOOR ,,SPORTS CAR FOR SALE. - India"
                            height="83" style="margin-top:0px;" />
                    </a>
                </div>
            </div>
            <div class="second-column-container  table-cell">
        <h3>
        <a href="http://newdelhi.olx.in/honda-prelude-2-door-sports-car-for-sale-iid-437128570"  title="HONDA PRELUDE,,2 DOOR ,,SPORTS CAR FOR SALE. - India">
        HONDA PRELUDE,,2 DOOR ,,SPORTS CAR FOR SALE.</a>
        </h3>


        <div class="c-4">
        <span>Year: 1996</span> - <span>Make: Honda</span> - <span>Model: Prelude</span> - <span>66,400.00 km</span>    </div>
        <span class="itemlistinginfo clearfix">
        <a href="http://newdelhi.olx.in/cars-cat-378">Cars - Delhi</a>    </span>

        <div style="display:none;" class="fbfriends_loadme" id="fbfriends_loadme_437128570" rel="5656149"></div>

            </div>            
            <div class="third-column-container table-cell">
                                    र 2,65,000.00                              </div>
            <div class="fourth-column-container table-cell">
                                    Yesterday, 15:53                            </div>            
        </div>
    </div>

我使用的正则表达式是-----

/<div class="li .*?"><div class="row clearfix"><div class="c-1 table-cell"><div class="cropit">.*?<\/div><\/div><div class="second-column-container  table-cell"><h3><a .*?>(.*?)<\/a><\/h3><div class="c-4"><span>(.*?)<\/span> - <span>(.*?)<\/span> - <span>(.*?)<\/span> - <span>(.*?)<\/span><\/div><span class="itemlistinginfo clearfix"><a .*?>(.*?)<\/a><\/span><div .*?><\/div><\/div><div class="third-column-container table-cell">(.*?)<\/div><div class="fourth-column-container table-cell">(.*?)<\/div><\/div><\/div>/'
4

1 回答 1

1

问题是如果您正在解析的源代码包含空格,您将无法匹配它。你应该时不时地洒\s*?

这同样适用于您的<a .*?>(.*?)<\/a>街区。.匹配空格字符,但不匹配换行符。使用<a .*?>\s*?(.*?)\s*?<\/a>. 每当您跳过一个大块时,.*?都不会这样做。请改用[\s\S]*?(空白或非空白)。

第三,你正在使用preg_match,它只给你一个元素。你应该使用preg_match_all

于 2012-09-10T09:44:28.523 回答