regex - Oracle 11g Regular expression Multiple Instances of Pattern

Question

I need to extract the information from '< a href="...">something.jpg< /a>' tags from a large string recursively that could contain multiple instances of the tags. I need to do this using regex on Oracle 11g.

An example of what I am looking for is:

Example String:

The string will always contain at least 1 instance of the < a> tag and there is no maximum to how many it can contain
The href will always a xid-[[:digit:]]
The attributes in the tag can vary

<p>text about something important</p><p><a href="@X@EmbeddedFile.requestUrlStub@X@bbcswebdav/xid-1234_1" target="_blank">file.pdf</a> </p><p><a href="@X@EmbeddedFile.requestUrlStub@X@bbcswebdav/xid-1235_1" target="_blank">anotherfile.pptx</a> </p><p><a href="@X@EmbeddedFile.requestUrlStub@X@bbcswebdav/xid-1236_1" target="_blank">yetanotherfile.pdf</a> </p>

Now with that string I want to extract the 3 < a ...>...< /a> blocks using
REGEXP_SUBSTR(< string>, '< pattern>', < start>, < occurrence >) and adjusting the occurrence value to grab the 3 instances.

What I have so far is:

SELECT REGEXP_SUBSTR(main_data, ''<a[[:print:]]+href="[[:print:]]+xid-1234_1"[[:print:]]+>[[:print:]]+</a>'', 1, 1)
      FROM table

and the results I get from that are

<a href="@X@EmbeddedFile.requestUrlStub@X@bbcswebdav/xid-1234_1" target="_blank">file.pdf</a> </p><p><a href="@X@EmbeddedFile.requestUrlStub@X@bbcswebdav/xid-1235_1" target="_blank">anotherfile.pptx</a> </p><p><a href="@X@EmbeddedFile.requestUrlStub@X@bbcswebdav/xid-1236_1" target="_blank">yetanotherfile.pdf</a>

So it is starting with the first < a and then grabbing all the way to the last < /a>. When I need it to stop at the first instance of < /a>. Then when I increment the occurrence to 2 it should grab the second set of < a>< /a> tags. However currently setting the occurrence to 2 nothing is returned.

Any help will be appreciated. Thank you

score 1 · Accepted Answer

Have you considered using Oracle's various XML facilities instead?

For example, place the text into a CLOB and then use xmltype() and extract() to get the elements out using an XPath query (see for example this question).

Generally, trying to extracted nested data structures using regexes leads to unhappiness.

score 0 · Accepted Answer

Yes, the non-greedy operator ? is the solution:

SELECT REGEXP_SUBSTR(x,'<a href="(.*?)".*?>(.*?)</a>',1, 3, 'i', 0)
  FROM (SELECT '<p>text about something important</p><p><a href="@X@EmbeddedFile.requestUrlStub@X@bbcswebdav/xid-1234_1" target="_blank">file.pdf</a> </p><p><a href="@X@EmbeddedFile.requestUrlStub@X@bbcswebdav/xid-1235_1" target="_blank">anotherfile.pptx</a> </p><p><a href="@X@EmbeddedFile.requestUrlStub@X@bbcswebdav/xid-1236_1" target="_blank">yetanotherfile.pdf</a> </p>' as x FROM DUAL);

returns

<a href="@X@EmbeddedFile.requestUrlStub@X@bbcswebdav/xid-1236_1" target="_blank">yetanotherfile.pdf</a>

or the other tags if you change the 3 to 1 or 2.

If you replace the last 0 with 1, you get the contents of the href:

@X@EmbeddedFile.requestUrlStub@X@bbcswebdav/xid-1236_1

If you replace it with 2, you get

yetanotherfile.pdf

score 0 · Accepted Answer

正如@Jacques Chester 所指出的，如果您可以使用 XML 支持，那么痛苦就会减少。

如果不能，请尝试将更改+为+?以执行非贪婪匹配。

+?限定符是Oracle 正则表达式中受 Perl 影响的扩展的一部分

regex - Oracle 11g Regular expression Multiple Instances of Pattern

3 回答 3

Related

Reference