由于预定义的标点字符类不适用于字符串的末尾,因此使用自定义字符类代替。故意留下点。单独添加单引号(因为转义它不起作用并且在这种情况下可能很难为q运算符找到正确的字符)。自行添加右方括号,因为 Oracle 在转义时似乎无法正确处理它。最后明确添加尾随的连续点:
WITH T (id, col) AS (
SELECT 1, 'GEORGE & SON ' FROM DUAL UNION ALL
SELECT 2, '-GEORGE & SON' FROM DUAL UNION ALL
SELECT 3, '&GEORGE & SON' FROM DUAL UNION ALL
SELECT 4, 'GEORGE & SON..' FROM DUAL UNION ALL
SELECT 5, 'GEORGE & SON.' FROM DUAL UNION ALL
SELECT 6, '-GEORGE & SON S.A.' FROM DUAL UNION ALL
SELECT 7, 'GEORGE & SON!' FROM DUAL UNION ALL
SELECT 8, 'GEORGE & SON"' FROM DUAL UNION ALL
SELECT 9, 'GEORGE & SON#' FROM DUAL UNION ALL
SELECT 10, 'GEORGE & SON$' FROM DUAL UNION ALL
SELECT 11, 'GEORGE & SON%' FROM DUAL UNION ALL
SELECT 12, 'GEORGE & SON&' FROM DUAL UNION ALL
SELECT 13, 'GEORGE & SON(' FROM DUAL UNION ALL
SELECT 14, 'GEORGE & SON)' FROM DUAL UNION ALL
SELECT 15, 'GEORGE & SON*' FROM DUAL UNION ALL
SELECT 16, 'GEORGE & SON+' FROM DUAL UNION ALL
SELECT 17, 'GEORGE & SON,' FROM DUAL UNION ALL
SELECT 18, 'GEORGE & SON\' FROM DUAL UNION ALL
SELECT 19, 'GEORGE & SON-' FROM DUAL UNION ALL
SELECT 20, 'GEORGE & SON\' FROM DUAL UNION ALL
SELECT 21, 'GEORGE & SON/' FROM DUAL UNION ALL
SELECT 22, 'GEORGE & SON:' FROM DUAL UNION ALL
SELECT 23, 'GEORGE & SON;' FROM DUAL UNION ALL
SELECT 24, 'GEORGE & SON<' FROM DUAL UNION ALL
SELECT 25, 'GEORGE & SON=' FROM DUAL UNION ALL
SELECT 26, 'GEORGE & SON>' FROM DUAL UNION ALL
SELECT 27, 'GEORGE & SON?' FROM DUAL UNION ALL
SELECT 28, 'GEORGE & SON@' FROM DUAL UNION ALL
SELECT 29, 'GEORGE & SON[' FROM DUAL UNION ALL
SELECT 30, 'GEORGE & SON^' FROM DUAL UNION ALL
SELECT 31, 'GEORGE & SON_' FROM DUAL UNION ALL
SELECT 32, 'GEORGE & SON`' FROM DUAL UNION ALL
SELECT 33, 'GEORGE & SON{' FROM DUAL UNION ALL
SELECT 34, 'GEORGE & SON|' FROM DUAL UNION ALL
SELECT 35, 'GEORGE & SON}' FROM DUAL UNION ALL
SELECT 36, 'GEORGE & SON~' FROM DUAL UNION ALL
SELECT 37, 'GEORGE & SON''' FROM DUAL UNION ALL
SELECT 38, 'GEORGE & SON]' FROM DUAL)
SELECT
* FROM T
WHERE REGEXP_LIKE(col, '(^[[:punct:]]|[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']$)|]$|\.\.$|(^[[:space:]]|[[:space:]]$)')
ORDER BY id
;
更新要求
标点后跟一个点
在特殊字符集中添加一个可选的点;从
'[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']$'
至
'[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$'
如在
WITH T (id, col) AS (
SELECT 40, 'GEORGE & SON^.'FROM DUAL UNION ALL
SELECT 41, 'GEORGE & SON_.'FROM DUAL UNION ALL
SELECT 42, 'GEORGE & SON`.'FROM DUAL UNION ALL
SELECT 43, 'GEORGE & SON{.'FROM DUAL UNION ALL
SELECT 44, 'GEORGE & SON|.'FROM DUAL UNION ALL
SELECT 45, 'GEORGE & SON}.'FROM DUAL UNION ALL
SELECT 46, 'GEORGE & SON~.'FROM DUAL UNION ALL
SELECT 47, 'GEORGE & SON''.'FROM DUAL UNION ALL
SELECT 48, 'GEORGE & SON].'FROM DUAL)
SELECT
* FROM T
WHERE REGEXP_LIKE(col, '([-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$)|]\.?$')
ORDER BY id
;
字符串中空格和特殊字符(组合)的重复
最初,只要求出现前导和尾随事件...... ;-)
两个或多个空格/标点字符的序列被
[[:space:][:punct:]]{2,}
如果你想在字符串中明确地使用它,只需要 - 用单词字符包围它们:
\w[[:space:][:punct:]]{2,}\w
当找到单个空格时,前导/尾随连续空格已经匹配 - 无需明确担心它们。
这使:
WITH T (id, col) AS (
SELECT 50, 'GEORGE & SON ' FROM DUAL UNION ALL
SELECT 51, 'GEORGE & SON ' FROM DUAL UNION ALL
SELECT 52, ' GEORGE & SON' FROM DUAL UNION ALL
SELECT 53, ' GEORGE & SON' FROM DUAL UNION ALL
SELECT 54, 'GEORGE & SON' FROM DUAL UNION ALL
SELECT 55, 'GEORGE & SON S.A.' FROM DUAL UNION ALL
SELECT 56, 'GEORGE & SON S.A.' FROM DUAL UNION ALL
SELECT 60, ' GEORGE and SON' FROM DUAL UNION ALL
SELECT 61, ' ,GEORGE and SON' FROM DUAL UNION ALL
SELECT 62, ', GEORGE and SON' FROM DUAL UNION ALL
SELECT 63, 'GEORGE -- SON' FROM DUAL UNION ALL
SELECT 64, 'GEORGE --SON' FROM DUAL UNION ALL
SELECT 65, 'GEORGE & SON' FROM DUAL UNION ALL
SELECT 66, 'GEORGE + SON' FROM DUAL UNION ALL
SELECT 67, 'GEORGE and , SON' FROM DUAL UNION ALL
SELECT 68, 'GEORGE and , SON' FROM DUAL UNION ALL
SELECT 69, 'GEORGE and SON ,' FROM DUAL UNION ALL
SELECT 70, 'GEORGE and SON. ' FROM DUAL UNION ALL
SELECT 71, 'GEORGE and+-SON' FROM DUAL)
SELECT
* FROM T
-- WHERE REGEXP_LIKE(col, '(^[[:punct:]]|[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$)|]$|\.\.$|(^[[:space:]]|[[:space:]]$)|[[:space:][:punct:]]{2,}')
WHERE REGEXP_LIKE(col, '(^[[:punct:]]|[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']\.?$)|]$|\.\.$|(^[[:space:]]|[[:space:]]$)|\w[[:space:][:punct:]]{2,}\w')
ORDER BY id
;
但这会产生误报,最突出的是GEORGE & SON。在某种程度上,可以通过将[:punct:]替换为包含较少的集合来避免这种情况。(最终)选择将取决于假阴性或假阳性是否更令人担忧。
看看它的实际效果:
捕获任意序列的标点符号和空格字符 - 但允许单个字母后跟一个点和一个空格
如前所述,误报需要与误报相平衡。一种方式或另一种方式。但是,这可能是考虑将整体问题分解为较小问题并单独处理它们的好时机。即使GEORGE 和 P. SON是完全可以接受的,您也可能想要查看,例如-GEORGE 和 P. SON。因此,让我们关注字符串中间的杂散字符序列——甚至记住之前的 ** 和 **,并允许枚举(因此允许使用逗号):
WHERE
REGEXP_LIKE(col, '\w[[:space:][:punct:]]{2,}\w')
AND
NOT REGEXP_LIKE(col, ' [[:upper:]]\. \w')
AND
NOT INSTR(col, ', ') > 0
AND
NOT INSTR(col, ' & ') > 0
大概紧随其后
WHERE
REGEXP_LIKE(col, '\w[[:space:][:punct:]]{2,}\w')
AND
(REGEXP_LIKE(col, ' [[:upper:]]\. \w')
OR
INSTR(col, ', ') > 0
OR
INSTR(col, ' & ') > 0
)
为了在许多有效值之间找到例如GEORGE 和 , SON 。INSTR
可能比 REGEX 更快 - 取决于整体情况......</p>
多说几句力学
(i) [[:punct:][:space:]]本质上将[[:punct:]]和[[:space:]]组合成一个字符类。就该类的选择而言,顺序无关紧要。
(二)
[-!"#$%&()*+,\/:;<=>?@[^_`{|}~' || '''' || ']
是
[-!"#$%&()*+,\/:;<=>?@[^_`{|}~]
添加了单引号。如果直接尝试,Oracle 会考虑使用单引号来结束参数值。并且用反斜杠转义单引号不起作用......所以基本上,这就是上面所说的“单独添加单引号”。
如果需要调整/进一步详细信息,请发表评论。