5

我们有一个将 html 内容块写入 sql server 数据库的 cms 系统。我知道这些 html 内容块所在的表名和字段名。一些 html 包含指向 pdf 文件的链接 ()。这是一个片段:

<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>

我需要从所有此类 html 内容块中提取 pdf 文件名。最后我需要得到一个列表:

Tuition-Reimbursement-Deferred.pdf
Some-other-file.pdf

来自该字段的所有 pdf 文件名。

任何帮助表示赞赏。谢谢。

更新

我收到了很多回复,非常感谢,但我忘了说我们这里还在使用 SQL Server 2000。因此,这必须使用 SQL 2000 SQL 来完成。

4

4 回答 4

3

创建这个函数

create function dbo.extract_filenames_from_a_tags (@s nvarchar(max))
returns @res table (pdf nvarchar(max)) as
begin
-- assumes there are no single quotes or double quotes in the PDF filename
declare @i int, @j int, @k int, @tmp nvarchar(max);
set @i = charindex(N'.pdf', @s);
while @i > 0
begin
  select @tmp = left(@s, @i+3);
  select @j = charindex('/', reverse(@tmp)); -- directory delimiter
  select @k = charindex('"', reverse(@tmp)); -- start of href
  if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
  select @k = charindex('''', reverse(@tmp)); -- start of href (single-quote*)
  if @j = 0 or (@k > 0 and @k < @j) set @j = @k;
  insert @res values (substring(@tmp, len(@tmp)-@j+2, len(@tmp)));
  select @s = stuff(@s, 1, @i+4, ''); -- remove up to ".pdf"
  set @i = charindex(N'.pdf', @s);
end
return
end
GO

使用该功能的演示

declare @t table (html varchar(max));
insert @t values
  ('
<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>'),
  ('
<p>A deferred tuition payment plan, 
or view the <a href="Two files here-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>And I use single quotes
   <a href=''/look/path/The second file.pdf''
target="_blank">list</a>');

select t.*, p.pdf
from @t t
cross apply dbo.extract_filenames_from_a_tags(html) p;

结果

|HTML                  |                                       PDF |
--------------------------------------------------------------------
|<p>A deferred tui.... |        Tuition-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... | Two files here-Reimbursement-Deferred.pdf |
|<p>A deferred tui.... |                       The second file.pdf |

SQL 小提琴演示

于 2013-04-25T21:25:40.010 回答
1

好吧,它并不漂亮,但这可以使用标准 Transact-SQL:

SELECT CASE WHEN CHARINDEX('.pdf', html) > 0
            THEN SUBSTRING(
                     html,
                     CHARINDEX('.pdf', html) -
                     PATINDEX(
                         '%["/]%',
                         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 1,
                     PATINDEX(
                         '%["/]%',
                         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 3)
            ELSE NULL
       END AS filename
FROM mytable

如果您愿意,可以扩展文件名之前的定界字符列表["/](匹配引号或斜杠)

请参阅SQL Fiddle 演示

于 2013-04-25T21:43:46.160 回答
1

将该 HTML 视为 XML 怎么样?

declare @t table (html varchar(max));
insert @t 
    select '
    <p>A deferred tuition payment plan, 
    or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
    target="_blank">list</a>.</p>'
    union all
    select '
    <p>A deferred tuition payment plan, 
    or view the <a href="Two files here-Reimbursement-Deferred.pdf"
    target="_blank">list</a>.</p>And I use single quotes
       <a href=''/look/path/The second file.pdf''
    target="_blank">list</a>'

select  [filename] = reverse(left(reverse('/'+p.n.value('@href', 'varchar(100)')), charindex('/',reverse('/'+p.n.value('@href', 'varchar(100)')), 1) - 1))
from    (   select  cast(html as xml)
            from    @t
        ) x(doc)
cross
apply doc.nodes('//a') p(n);

结果:

filename
---------------------------------------------------------------
Tuition-Reimbursement-Deferred.pdf
Two files here-Reimbursement-Deferred.pdf
The second file.pdf
于 2013-04-25T22:06:27.243 回答
1

试试这个——

DECLARE @XML XML = 
'<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf"
target="_blank">list</a>.</p>'

SELECT 
      ref_text = t.p.value('./a[1]', 'NVARCHAR(50)')
    , ref_filename = REVERSE(
                        LEFT(REVERSE(t.p.value('./a[1]/@href', 'NVARCHAR(50)')), 
                        CHARINDEX('/',REVERSE(t.p.value('./a[1]/@href', 'NVARCHAR(50)')), 1) - 1))
FROM @XML.nodes('/p') t(p)
于 2013-04-26T05:25:23.387 回答