0

I have a list of urls that I would like to parse:

['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']

I would like to use a Regex expression to create a new list containing the numbers at the end of the string and any letters before punctuation (some strings contain numbers in two positions, as the first string in the list above shows). So the new list would look like:

['20170303', '20160929a', '20161005a']

This is what I've tried with no luck:

code = re.search(r'?[0-9a-z]*', urls)

Update:

Running -

[re.search(r'(\d+)\D+$', url).group(1) for url in urls]

I get the following error -

AttributeError: 'NoneType' object has no attribute 'group'

Also, it doesn't seem like this will pick up a letter after the numbers if a letter is there..!

4

4 回答 4

0

您可以使用此正则表达式(\d+[a-z]*)\.

正则表达式演示

输出

20170303
20160929a
20161005a
于 2017-06-23T16:51:47.107 回答
0
# python3

from urllib.parse import urlparse
from os.path import basename

def extract_id(url):
    path = urlparse(url).path
    resource = basename(path)
    _id = re.search('\d[^.]*', resource)
    if _id:
        return _id.group(0)

urls =['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']

# /!\ here you have None if pattern doesn't exist ;) in ids list
ids = [extract_id(url) for url in urls]

print(ids)

输出:

['20170303', '20160929a', '20161005a']
于 2017-06-23T16:53:50.613 回答
0

鉴于:

>>> lios=['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']

你可以做:

for s in lios:
    m=re.search(r'(\d+\w*)\D+$', s)
    if m:
        print m.group(1)

印刷:

20170303
20160929a
20161005a

这是基于这个正则表达式:

(\d+\w*)\D+$
  ^              digits
     ^           any non digits
        ^        non digits
           ^     end of string
于 2017-06-23T16:47:22.803 回答
-1
import re

patterns = {
    'url_refs': re.compile("(\d+[a-z]*)\."),  # YCF_L
}

def scan(iterable, pattern=None):
    """Scan for matches in an iterable."""
    for item in iterable:
        # if you want only one, add a comma:
        # reference, = pattern.findall(item)
        # but it's less reusable.
        matches = pattern.findall(item)
        yield matches

然后你可以这样做:

hits = scan(urls, pattern=patterns['url_refs'])
references = (item[0] for item in hits)

喂给references你的其他功能。你可以通过这种方式处理更多的东西,而且我想做得更快。

于 2017-06-23T17:36:02.240 回答