python - Python Beautiful Soup Regex Positive Lookbehind and Lookahead (?<=)(.*)(?=) Ignores Lookahead until Last Instance

Question

# python 3.7.3
import requests
import csv
from bs4 import BeautifulSoup
import re


url = "https://www.brownells.com/ammunition/handgun-ammo/usa-white-box-ammo-380-auto-95gr-fmj-prod95261.aspx"

response = requests.get(url,headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
src = response.content
soup = BeautifulSoup(src, 'lxml')
price = soup.find("script", {"id": "rawData"})
price = re.search(r'(?<=\\u003cspan\\u003e\$).*(?=\\u003c/span\\u003e\\u003cspan)', price.text)
print(price[0])
# print(price[1])
# There are multiple patterns to match in the string, and I'm planning to pull the 2nd or 3rd one, not just the first. 
# For simplicity I'm just pulling the first above.

预期：16.99

实际：16.99\u003c/span...\u003cspan\u003e$32.99（字符串的其余部分，直到 \u003c/span\u003e\u003cspan 的最后一个实例）

我在 regexr 和 regex101 中测试了我的正则表达式，它在那里工作：

https://regexr.com/526t8

https://regex101.com/r/yzMkTg/103

我还在字符串上尝试了正则表达式，它工作正常：

import requests
import csv
from bs4 import BeautifulSoup
import re


price = "\\u003cspan\\u003e$321.99\\u003c/span\\u003e\\"
# \u003cspan\u003e$321.99\u003c/span\u003e\
print(price)
price = re.search(r'(?<=\\u003cspan\\u003e\$)(.*)(?=\\u003c/span\\u003e)', price)
print(price[0])
# print(price[1])
# \\u003cspan\\u003e$321.99\\u003c/span\\u003e\\
# View OSHA SDS\u003c/a\u003e\r\n        \u003c/section\u003e\r\n
# \u003cspan\u003e$321.99\u003c/span\u003e\

Beautiful Soup 的某些东西似乎会绊倒某些东西并导致它跳到正向前瞻的最后一个实例。

为什么python的正则表达式在字符串匹配后采取积极的前瞻性但忽略积极的前瞻，直到前瞻的最后一个实例？

python - Python Beautiful Soup Regex Positive Lookbehind and Lookahead (?<=)(.*)(?=) Ignores Lookahead until Last Instance

0 回答 0

Related

Reference