维基百科上的第一句话几乎总是说什么is, was, are or were
。因此,一个可能的解决方案是在达到连接动词(is、was、are、were)之前不要结束句子。当然,这不会 100% 准确,但这是一个可能的解决方案:
def get_first_sentence(my_string):
linking_verbs = set(['was', 'is', 'are', 'were'])
split_string = my_string.split(' ')
first_sentence = []
linked_verb_booly = False
for ele in split_string:
first_sentence.append(ele)
if ele in linking_verbs:
linked_verb_booly = True
if '.' in ele and linked_verb_booly == True:
break
return ' '.join(first_sentence)
示例 1:
温斯顿伦纳德斯宾塞 - 丘吉尔爵士,KG,OM,CH,TD,PC,DL,FRS,Hon。RA(1874 年 11 月 30 日 - 1965 年 1 月 24 日)是英国政治家和政治家,以在第二次世界大战期间领导英国而闻名。他被广泛认为是战时最伟大的领导人之一,曾两次担任总理。丘吉尔是一位著名的政治家和演说家,也是英国陆军军官、历史学家、作家和艺术家。
my_string_1 = 'Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 – 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War. He is widely regarded as one of the great wartime leaders and served as Prime Minister twice. A noted statesman and orator, Churchill was also an officer in the British Army, a historian, a writer, and an artist.'
first_sentence_1 = get_first_sentence(my_string_1)
结果:
>>> first_sentence_1
'Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 \xe2\x80\x93 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War.'
示例 2:
Python 是一种通用的高级编程语言[11],其设计理念强调代码的可读性。据说它的语法清晰 [12] 且富有表现力。 [13] Python 有一个庞大而全面的标准库。 [14]
结果:
>>> first_sentence_2
'Python is a general-purpose, high-level programming language[11] whose design philosophy emphasizes code readability.'
示例 3:
中国(Listeni/ˈtʃaɪnə/;中文:中国;拼音:Zhōngguó;另见中国名称),正式名称为中华人民共和国(PRC),是世界上人口最多的国家,人口超过 13 亿。东亚国家占地约 960 万平方公里,是世界上陆地面积第二大的国家,[13] 根据总面积的定义,其总面积位居第三或第四大国家。 [14]
my_string_3 = "China (Listeni/ˈtʃaɪnə/; Chinese: 中国; pinyin: Zhōngguó; see also Names of China), officially the People's Republic of China (PRC), is the world's most-populous country, with a population of over 1.3 billion. Covering approximately 9.6 million square kilometres, the East Asian state is the world's second-largest country by land area,[13] and the third- or fourth-largest in total area, depending on the definition of total area.[14]"
first_sentence_3 = get_first_sentence(my_string_3)
结果:
>>> first_sentence_3
"China (Listeni/\xcb\x88t\xca\x83a\xc9\xaan\xc9\x99/; Chinese: \xe4\xb8\xad\xe5\x9b\xbd; pinyin: Zh\xc5\x8dnggu\xc3\xb3; see also Names of China), officially the People's Republic of China (PRC), is the world's most-populous country, with a population of over 1.3"
您可以在最后一个示例中看到限制,句子被截断到早期,因为“。” 在 1.3 中。
此外,使用正则表达式可能会更好地完成上述操作。
只是一个想法。