python - 如何使用findall函数从python中的文本文件中提取特定的url

Question

所以我有以下文本示例：

Good Morning,

The link to your exam is https://uni.edu?hash=89234rw89yfw8fw89ef .Please complete it within the stipulated time.

If you have any issue, please contact us
https://www.uni.edu
https://facebook.com/uniedu

我想要的是提取考试链接的网址：https ://uni.edu?hash=89234rw89yfw8fw89ef 。我打算使用 findAll() 函数，但我很难编写正则表达式来提取特定的 url。

import re

def find_exam_url(text_file):
    filename = open(text_file, "r")
    new_file = filename.readlines()
    word_lst = []

    for line in new_file:
        exam_url = re.findall('https?://', line) #use regex to extract exam url
    return exam_url

if __name__ == "__main__":
   print(find_exam_url("mytextfile.txt"))

我得到的输出是：

['http://']

代替：

https://uni.edu?hash=89234rw89yfw8fw89ef

将不胜感激这方面的一些帮助。

score 0 · Accepted Answer

此正则表达式有效：

>>> re.findall('(https?://.*?)\s', s) 
['https://uni.edu?hash=89234rw89yfw8fw89ef',
 'https://www.uni.edu',
 'https://facebook.com/uniedu']

wheres表示文件中的文本（由读取f.read()），使用的模式是(https?://.*?)\s（延迟匹配，直到出现空白）。

如果您需要提取作为考试链接提到的 url，您可以使正则表达式更具体：

>>> re.findall('exam.*(https?://.*?)\s', s) 
['https://uni.edu?hash=89234rw89yfw8fw89ef']

或者看起来考试链接会有一个标识符/URL参数，格式为?hash=，所以这样的东西更好

>>> re.findall('(https?://.*\?hash=.*?)\s', s) 
['https://uni.edu?hash=89234rw89yfw8fw89ef']

python - 如何使用findall函数从python中的文本文件中提取特定的url

1 回答 1

Related

Reference