python - HREF 值使用 BS4 搜索网页

Question

我正在开发第 3 方应用程序，我已经阅读了网页源内容的视图。从那里我们必须只收集一些href具有类似/aems/file/filegetrevision.do?fileEntityId. 是否可以？我的一个给了我所有的href价值观。

HTML *（HTML 的一部分） *

<td width="50%">
<a href="/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz">
screenshot.doc
</a>
</td>

代码

for a in soup.find_all('a', {"style": "display:inline; position:relative;"}, href=True):
    href = a['href'].strip()
    href = "https://xyz.test.com/" + href
print(href)

谢谢

谢谢，

score 2 · Accepted Answer

href是的，只需为属性使用适当的过滤器。像

def filter(href):
    return '/aems/file/filegetrevision' in href

soup.find_all('a', href=filter)

除了函数，您还可以使用RegexObject对象作为过滤器：

filter = re.compile(some_regular_expression)
soup.find_all('a', href=filter)

请参阅文档：过滤器种类

python - HREF 值使用 BS4 搜索网页

1 回答 1

Related

Reference