php - 使用 xpath 或 ->nextSibling 使用 curl 读取 html 文件

Question

请帮助如何使用 DOMDocument 函数从下面的示例 html 代码中提取数据，如 15、教学期 1 和 2.、Max 150. 等？我试图解决它，但是在我想提取的文本之前有多个标签使我更难一次提取所有内容，因为我必须将所有提取的数据保存到 mysql 数据库中。

<P><SPAN STYLE="font-size: 16pt; font-weight: bold"><a name="CS1050"class="modtitle">CS1050 Fundamentals of Internet Computing</a></SPAN></p>
<P><B>Credit Weighting: </B>15<BR><BR>
<B>Teaching Period(s): </B>Teaching Periods 1 and 2.<BR><BR>
<B>No. of Students: </B>Max 150.<BR><BR>
<B>Pre-requisite(s): </B>None<BR><BR>
<B>Co-requisite(s): </B>None<BR><BR>
<B>Teaching Methods: </B>72 x 1hr(s) Lectures; 18 x 2hr(s) Practicals.<BR><BR>
<B>Module Co-ordinator: </B>Professor Gregory Provan, Department of Computer Science.     <BR><BR>
<B>Lecturer(s): </B> Mr Gavin Russell, Department of Computer Science.<BR><BR>
<B>Module Objective: </B>To introduce students to Internet computer systems, web design, and<BR>client-side programming.<BR><BR>
<B>Module Content: </B>This module provides an introduction to the key concepts of Internet computing. Starting with the fundamentals of computer systems and the Internet, students progress to learn how to design web sites and how to utilize simple client-side programming. Issues related to user interface design and human-computer interfacing (HCI) are covered. Broader issues related to the use of the Internet for Blogging and Social Networks are discussed. The practical element of the module allows students to develop skills necessary for web site design using simple client side programming.<BR><BR>
<B>Learning Outcomes: </B>On successful completion of this module, students should be able to:<BR>&middot; Understand the fundamental principles of computer systems and the Internet;<BR>&middot; Design web sites;<BR>&middot; Use simple client-side programming;<BR>&middot; Understand the principles of user interface design and human-computer interfaces.<BR><BR>
<B>Assessment: </B>Total Marks 300: End of Year Written Examination 240 marks; Continuous Assessment 60 marks (Departmental Tests; Assignments).<BR><BR>
<B>Compulsory Elements: </B>End of Year Written Examination; Continuous Assessment.<BR<BR>
<B>Penalties (for late submission of Course/Project Work etc.): </B>Work which is submitted late shall be assigned a mark of zero (or a Fail Judgement in the case of Pass/Fail modules).<BR><BR>
<B>Pass Standard and any Special Requirements for Passing Module: </B>40%.<BR><BR>
<B>End of Year Written Examination Profile: </B>1 x 3 hr(s) paper(s).<BR><BR>
<B>Requirements for Supplemental Examination: </B>1 x 3 hr(s) paper(s) to be taken in Autumn. The mark for Continuous Assessment is carried forward.</P>



                   MY SAMPLE CURL CODE

$content3= $dom->getElementsByTagname('p');
$content4 = $dom->getElementsByTagname('b');

        //===========================================
        //=====  EXTRACT P STUFF ====================
        //===========================================

        foreach ($content3 as $value) {
            $contentnew[]= $value;
        print_r($value); 


        echo "Attribute Value = ";
        echo $value->getAttribute('value');
        echo "<br />";


        // let's get hold of the text value from the node
        $mytempvariable=$value->nodeValue;
        print "CONTENT OF P NODE: \n\n$mytempvariable <br /> <br />\n\n\n";
        }
        echo "<br /> <br /> <br />";



        //===========================================
        //===== EXTRACT B STUFF =====================
        //===========================================
        foreach ($content4 as $value) {
            $contentnew[]= $value;


        echo "Attribute Value = ";
        echo $value->getAttribute('value');
        echo "<br />";

        print_r($value); 
        // let's get hold of the text value from the node
        $mytempvariable=$value->nodeValue;
        print "CONTENT OF B NODE: \n\n$mytempvariable <br /> <br />\n\n\n";
        }
        echo "<br /> <br /> <br />";

我听说我可以使用 ->nextSibling 或 xpath 来提取所有 b 节点之后的所有数据，但我似乎无法使用 xpath 来提取我需要的所有相关数据。我可以这样做吗？

score 0 · Accepted Answer

你非常接近：

$result = array();
foreach($dom->getElementsByTagName('b') as $node){
    $result[preg_replace('/:\s+$/','',$node->textContent)] = trim($node->nextSibling->textContent);
}
var_dump($result);

php - 使用 xpath 或 ->nextSibling 使用 curl 读取 html 文件

1 回答 1

Related

Reference