1

This is the code I have, but it prints the whole paragraph. How to print the first sentence only, up to the first dot?

from bs4 import BeautifulSoup
import urllib.request,time

article = 'https://www.theguardian.com/science/2012/\
oct/03/philosophy-artificial-intelligence'

req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()

soup = BeautifulSoup(html,'lxml')

def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        print(soup.find_all('p')[0].get_text())

This code prints:

To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.

BUT I ONLY want it to print:

To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.

Thanks for help

4

4 回答 4

4

拆分该点上的文本;对于单个拆分,使用str.partition()str.split()使用限制更快:

text = soup.find_all('p')[0].get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

如果您只需要处理第一个 <p>元素,请soup.find()改用:

text = soup.find('p').get_text()
if len(text) > 100:
    text = text.partition('.')[0] + '.'
print(text)

但是,对于您给定的 URL,示例文本位于第二段

>>> soup.find_all('p')[1]
<p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
>>> text = soup.find_all('p')[1].get_text()
>>> text.partition('.')[0] + '.'
'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'
于 2016-02-09T13:02:03.427 回答
0

split第一段period。参数1种类MAXSPLIT和节省你不必要的额外分裂的时间。

def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        my_paragraph = soup.find_all('p')[0].get_text()
        my_list = my_paragraph.split('.', 1)
        print(my_list[0])
于 2016-02-09T13:12:32.843 回答
0
def print_intro():
    if len(soup.find_all('p')[0].get_text()) > 100:
        paragraph = soup.find_all('p')[0].get_text()
        phrase_list = paragraph.split('.')
        print(phrase_list[0])
于 2016-02-09T13:06:45.047 回答
-1

您可以使用find('.'),它返回您要查找的内容的第一次出现的索引。

因此,如果段落存储在一个名为的变量中paragraph

sentence_index = paragraph.find('.')
# add the '.'
sentence += 1
print(paragraph[0: sentence_index])

显然这里缺少控制部分,例如检查paragraph变量中包含的字符串是否有'。' 等等。无论如何,如果 find() 没有找到您要查找的子字符串,则返回 -1。

于 2016-02-09T13:13:43.283 回答