oracle - 替换与 Oracle 中的模式不匹配的文本

Question

我在表中的 CLOB 中有以下文本

表名：tbl1
列
col1 - 数字（主键）
col2 - clob（如下）

Row#1
-----
Col1 = 1
Col2 =
1331882981,ab123456,这里
有一些文本可以运行多行并且有很多文本...
~1331890329,pqr123223,更多文本...

Row#2
-----
Col1 = 2
Col2 =
1331882981,abc333,这里
的一些文本可以运行多行和有很多文字...
~1331890329,pqrs23,还有一些文字...

现在我需要知道我们如何才能得到低于输出

Col1 值
---- -------------- --------
1 1331882981,ab123456
1 1331890329,pqr123223
2 1331882981,abc333
2 1331890329,pqrs23

([0-9]{10},[az 0-9]+.), ==> 这是匹配“1331890329,pqrs23”的正则表达式，我需要知道如何替换哪些不匹配这个正则表达式和然后将它们分成多行

EDIT#1
我在 Oracle 10.2.0.5.0 上，因此不能使用 REGEXP_COUNT 函数:-(另外，col2 是一个很大的 CLOB

EDIT#2
我试过下面的查询，它适用于一些记录（即如果我添加一个“where”子句）。但是当我删除“where”时，它永远不会返回任何结果。我试图把它放到一个视图中并插入一个表中，让它在一夜之间运行，但它仍然没有完成:(

with t as (select col1, col2 from temp_table)
select col1,
       cast(substr(regexp_substr(col2, '[^~]+', 1, level), 1, 50) as
            varchar2(50)) data
  from t
connect by level <= length(col2) - length(replace(col2, '~')) + 1

编辑#3

Clob Total 中的字符数
------------ -----
0 - 1k 3196
1k - 5k 2865
5k - 25k 661
25k - 100k 36
> 100k 2
------------ -----
总计 6760

我有大约 7k 行 clob，其分布如上所示......

score 1 · Accepted Answer

好吧，您可以尝试以下方法：

with v as
(
  select 1 col1, '1331882981,ab123456,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqr123223,Some more text...' col2 from dual
  union all
  select 2 col1, '133188298777,abc333,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqrs23,Some more text...' col2 from dual
)
select distinct col1, regexp_substr(col2, '([0-9]{10},[a-z 0-9]+)', 1, level) split
from v
connect by level <= REGEXP_COUNT(col2, '([0-9]{10},[a-z0-9]+)')
order by col1
;

这给出了：

1   1331882981,ab123456
1   1331890329,pqr123223
2   1331890329,pqrs23
2   3188298777,abc333

编辑：对于 10g，REGEXP_COUNT不存在，但您有解决方法。在这里，我用我希望在文本中找不到的东西替换找到的模式（在这里，XYZXYZ但你可以选择更复杂的东西来确保自信），用相同的匹配做一个差异，但用空字符串替换，然后除通过我的模式长度（这里，6）：

with v as
(
  select 1 col1, '1331882981,ab123456,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqr123223,Some more text...' col2 from dual
  union all
  select 2 col1, '133188298777,abc333,Some text here
which can run multiple lines and have a lot of text...
~1331890329,pqrs23,Some more text...' col2 from dual
)
select distinct col1, regexp_substr(col2, '([0-9]{10},[a-z 0-9]+)', 1, level) split
from v
connect by level <= (length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', 'XYZXYZ')) - length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', ''))) / 6
order by col1
;

编辑2： CLOB（和一般的LOB）和正则表达式似乎不能很好地结合在一起：

ORA-00932: inconsistent datatypes: expected - got CLOB

将 CLOG 转换为字符串 ( regexp_substr(to_char(col2), ...) 似乎可以解决问题。

编辑 3： CLOB 也不喜欢distinct，因此在嵌入式请求中将拆分结果转换为 char，然后distinct在上层请求中使用成功！

select distinct col1, split from
(
    select col1, to_char(regexp_substr(col2, '([0-9]{10},[a-z 0-9]+)', 1, level)) split
    from temp_epn
    connect by level <= (length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', 'XYZXYZ')) - length(REGEXP_REPLACE(col2, '([0-9]{10},[a-z 0-9]+)', ''))) / 6
    order by col1
);

score 0 · Accepted Answer

上述解决方案不起作用，以下是我所做的。

update temp_table set col2=regexp_replace(col2,'([0-9]{10},[a-z0-9]+)','(\1)') ;
update temp_table set col2=regexp_replace(col2,'\),[\s\S]*~\(','(\1)$');
update temp_table set col2=regexp_replace(col2,'\).*?\(','$');
update temp_table set col2=replace(regexp_replace(col2,'\).*',''),'(','');

在这 4 个更新命令之后，col2 将有类似

1 1331882981,ab123456$1331890329,pqr123223
2 1331882981,abc333$1331890329,pqrs23

然后我写了一个函数来拆分这个东西。我选择这个函数的原因是用“$”分割，而且 col2 仍然有 >10k 个字符

create or replace function parse( p_clob in clob ) return sys.odciVarchar2List
pipelined
as
        l_offset number := 1;
        l_clob   clob := translate( p_clob, chr(13)|| chr(10) || chr(9), '   ' ) || '$';
        l_hit    number;
begin
        loop
          --Find occurance of "$" from l_offset
          l_hit := instr( l_clob, '$', l_offset );
          exit when nvl(l_hit,0) = 0;
          --Extract string from l_offset to l_hit
          pipe row ( substr(l_clob, l_offset , (l_hit - l_offset)) );
          --Move offset
          l_offset := l_hit+1;
        end loop;
end;

然后我打电话

select col1,
       REGEXP_SUBSTR(column_value, '[^,]+', 1, 1) col3,
       REGEXP_SUBSTR(column_value, '[^,]+', 1, 2) col4
  from temp_table, table(parse(temp_table.col2));

oracle - 替换与 Oracle 中的模式不匹配的文本

2 回答 2

Related

Reference