首先,这是我之前的问题的产物。我再次发布此消息是因为我在原始帖子中接受了其答案的人建议我这样做,因为他认为该问题以前没有正确定义。尝试2:
我正在尝试从此网页中获取信息。为清楚起见,以下是页面源代码块的选择:
<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology
<span class='distribution'>(SCI)</span></p>
<span class='normaltext'>
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [<span class='Helpcourse'
onMouseover="showtip(this,event,'24 Lectures')"
onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
onMouseover="showtip(this,event,'12 Tutorials')"
onMouseout="hidetip()">12T</span>]<br>
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br>
从上面的示例块中,我想提取以下信息:
ANT101H5 Introduction to Biological Anthropology and Archaeology
Exclusion: ANT100Y5
Prerequisite: ANT102H5
我想从网页上获取所有此类信息(请记住,某些课程可能还有额外列出的“共同要求”,或者可能根本没有列出任何先决条件/共同要求或排除项)。
我一直在尝试为此任务编写一个适当的 xpath 表达式,但我似乎无法做到恰到好处。
到目前为止,在Dimitre Novatchev的帮助下,我已经能够使用以下表达式:
sites = hxs.select("(//p[@class='titlestyle'])[2]/text()[1] | (//span[@class='title2'])[2]/text() | \
(//span[@class='title2'])[2]/following-sibling::a[1]/text() | (//span[@class='title2'])[3]/text() | \
(//span[@class='title2'])[3]/following-sibling::a[1]/text()")
但是,它会产生以下输出,似乎只获取页面上第一门课程的信息:
[{"desc": "ANT101H5 Introduction to Biological Anthropology and Archaeology \n "},
{"desc": "Exclusion: "},
{"desc": "ANT100Y5"},
{"desc": "Prerequisite: "},
{"desc": "ANT102H5"}]
绝对清楚,这个输出只有在它获得关于第一门课程的正确信息的情况下才是正确的。对于该网页上列出的所有课程,我需要这样的正确信息。
我是如此接近,但我似乎无法弄清楚最后一步。
我会很感激任何帮助...在此先感谢