2

我有来自页面源的以下代码片段:

var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); 

'PDFObject('

在页面上是唯一的。我想使用 REGEX 检索 url 内容。在这种情况下,我需要得到

http://www.site.com/doc55.pdf

请帮忙。

4

7 回答 7

3

这是在不使用正则表达式的情况下解决问题的替代方法:

url,in_object = None, False
with open('input') as f:
    for line in f:
        in_object = in_object or 'PDFObject(' in line
        if in_object and 'url:' in line:
            url = line.split('"')[1]
            break
print url
于 2013-07-04T20:48:51.780 回答
0

为了能够找到“在其他事情之后发生的事情”,您需要匹配“包括换行符”的事情。为此,您使用 (dotall) 修饰符 - 在编译期间添加的标志。

因此,以下代码有效:

import re
r = re.compile(r'(?<=PDFObject).*?url:.*?(http.*?)"', re.DOTALL)
s = '''var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''

print r.findall(s)

解释:

r = re.compile(         compile regular expression
    r'                  treat this string as a regular expression
    (?<=PDFObject)      the match I want happens right after PDFObject
    .*?                 then there may be some other characters...
    url:                followed by the string url:
    .*?                 then match whatever follows until you get to the first instance (`?` : non-greedy match of
    (http:.*?)"         match the string http: up to (but not including) the first "
    ',                  end of regex string, but there's more...
    re.DOTALL)          set the DOTALL flag - this means the dot matches all characters
                        including newlines. This allows the match to continue from one line
                        to the next in the .*? right after the lookbehind
于 2013-07-04T21:02:47.087 回答
0

这有效:

import re

src='''\
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
URL: "http://www.site.com/doc52.PDF",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''   

print [m.group(1).strip('"') for m in 
        re.finditer(r'^url:\s*(.*)[\W]$',
        re.search(r'PDFObject\(\{(.*)',src,re.M | re.S | re.I).group(1),re.M|re.I)]

印刷:

['http://www.site.com/doc55.pdf', 'http://www.site.com/doc52.PDF']
于 2013-07-04T20:48:17.410 回答
0

正则表达式

new\s+PDFObject\(\{\s*url:\s*"[^"]+"

正则表达式图片

演示

仅提取网址

于 2013-07-04T21:04:13.847 回答
0

尽管其他答案可能看起来有效,但大多数人没有考虑到页面上唯一独特的东西是'PDFObject('。更好的正则表达式如下:

PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",

它考虑到 'PDFObject(' 是唯一的并且包含一些基本的 URL 验证。

以下是如何在 python 中使用此正则表达式的示例

>>> import re
>>> strs = """var myPDF = new PDFObject({
... url: "http://www.site.com/doc55.pdf",
...   id: "pdfObjectContainer",
...   width: "100%",
...   height: "700px",
...   pdfOpenParams: {
...     navpanes: 0,
...     statusbar: 1,
...     toolbar: 1,
...     view: "FitH"
...   }
... }).embed("pdf_placeholder");"""
>>> re.search(r'PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",',strs).group(1)
'http://www.site.com/doc55.pdf'

纯python(无正则表达式)替代方案是:

>>> unique = 'PDFObject({\nurl: "'
>>> start = strs.find(unique) + len(unique)
>>> end = start + strs[start:].find('"')
>>> strs[start:end]
'http://www.site.com/doc55.pdf'

没有正则表达式 oneliner:

>>> (lambda u:(lambda s:(lambda e:strs[s:e])(s+strs[s:].find('"')))(strs.find(u)+len(u)))('PDFObject({\nurl: "')
'http://www.site.com/doc55.pdf'
于 2013-07-04T20:46:20.700 回答
0

如果'PDFObject('是页面中的唯一标识符,则只需匹配下一个引用的第一个内容。

使用DOTALL 标志re.DOTALLre.S)和非贪婪星号(*?),您可以编写:

import re

snippet = '''                                    
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder");
'''

# First version using unnamed groups
RE_UNNAMED = re.compile(r'PDFObject\(.*?"(.*?)"', re.S)

# Second version using named groups
RE_NAMED = re.compile(r'PDFObject\(.*?"(?P<url>.*?)"', re.S)

RE_UNNAMED.search(snippet, re.S).group(1)
RE_NAMED.search(snippet, re.S).group('url')
# result for both: 'http://www.site.com/doc55.pdf'

如果您不想编译您的正则表达式,因为它只使用过一次,只需使用以下语法:

re.search(r'PDFObject\(.*?"(.*?)"', snippet, re.S).group(1)
re.search(r'PDFObject\(.*?"(?P<url>.*?)"', snippet, re.S).group('url')

四种选择,一种应符合您的需要和口味!

于 2013-07-04T21:35:22.440 回答
0

使用后视和前瞻断言的组合

import re
re.search(r'(?<=url:).*?(?=",)', s).group().strip('" ')
'http://www.site.com/doc55.pdf'
于 2013-07-04T20:41:20.593 回答