python - 如何获取没有 HTML 标签的文本 | 在拆分中添加多个分隔符

Question

在 XPath 之后选择带有类 ajaxcourseindentfix 的 div 元素，并将其从先决条件中拆分出来，并在先决条件之后为我提供所有内容。

div = soup.select("div.ajaxcourseindentfix")[0]
" ".join([word for word in div.stripped_strings]).split("Prerequisite: ")[-1]

我的 div 不仅可以有先决条件，还可以有以下拆分点：

先决条件核心条件
核心
条件

现在，只要我有Prerequisite，上面的 XPath 就可以正常工作，但是只要上面三个中的任何东西出现，XPath 就会失败并给我整个文本。

有没有办法在 XPath 中放置多个分隔符？或者我该如何解决？

示例页面：

并存 URL：http ://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96106&show

先决条件网址：http ://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show

两者：http ://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=98590&show

[旧线程] -如何获取没有 HTML 标记的文本

score 1 · Accepted Answer

除非您特别需要 XPath，否则此代码是您问题的解决方案，我还建议您查看有关我使用的方法的BeautifulSoup文档，您可以在这里找到

.next_element.next_sibling在这些情况下非常有用。或者.next_elements我们将得到一个生成器，我们必须对其进行转换或以我们可以操纵生成器的方式使用它。

from bs4 import BeautifulSoup
import requests


url = 'http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show'
makereq = requests.get(url).text

soup = BeautifulSoup(makereq, 'lxml')

whole = soup.find('td', {'class': 'custompad_10'})
# we select the whole table (td), not needed in this case
thedivs = whole.find_all('div')
# list of all divs and elements within them

title_h3 = thedivs[2]
# we select only yhe second one (list) and save it in a var

mytitle = title_h3.h3
# using .h3 we can traverse (go to the child <h3> element)

mylist = list(mytitle.next_elements)
# title_h3.h3 is still part of a three and we save all the neighbor elements 

the_text = mylist[3]
# we can then select specific elements 
# from a generator that we've converted into a list (i.e. list(...))

prequisite = mylist[6]

which_cpsc = mylist[8]

other_text = mylist[11]

print(the_text, ' is the text')
print(which_cpsc, other_text, ' is the cpsc and othertext ')
# this is for testing purposes

解决了这两个问题，我们不必使用CSS 选择器和那些奇怪的列表操作。一切都是有机的，运作良好。

python - 如何获取没有 HTML 标签的文本 | 在拆分中添加多个分隔符

1 回答 1

Related

Reference