regex - 用于提取授权号的 RegEx 模式

Question

我正在使用GSKinner 的 Reg Exr 工具来帮助提出一种模式，该模式可以在包含大量其他垃圾的字段中找到授权号。授权号是一个包含字母（有时）、数字（总是）和连字符（有时）的字符串（即授权总是在某处包含一个数字，但并不总是包含连字符和字母）。此外，授权号可以位于我正在搜索的字段中的任何位置。

正确授权号的示例包括：

 5555834384734  ' All digits
 12110-AANM     ' Alpha plus digits, plus hyphens
 R-455545-AB-9  ' Alpha plus digits, plus multiple hyphens
 R-45-54A-AB-9  ' Alpha plus digits, plus multiple hyphens
 W892160        ' Alpha plus digits without hypens

这是一些带有额外垃圾的示例数据，有时会用连字符或没有空格附加到真实授权号，使其看起来像数字的一部分。垃圾以可预测的形式/单词出现：REF、CHEST、IP、AMB、OBV 和 HOLD，它们不是授权号的一部分。

 5557653700 IP
 R025257413-001
 REF 120407175
 SNK601M71016
 U0504124 AMB
 W892160
 019870270000000
 00Q926K2
 A025229563
 01615217 AMB
 12042-0148
 SNK601M71016
 12096NHP174
 12100-ACDE
 12110-AANM
 12114AD5QIP
 REF-34555
 3681869/OBV ONL

这是我正在使用的模式：

 "\b[a-zA-Z]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]*"

我正在学习 RegExp，所以它无疑可以改进，但它适用于上述情况，但不适用于以下情况：

 REFA5-208-4990IP  'Extract the string 'A5-208-4990'without REF or IP
 OBV1213110379     'Extract the string '1213110379' without the OBV
 5520849900AMB     'Extract the string '5520849900' without AMB
 5520849900CHEST   'Extract the string '5520849900' without CHEST
 5520849900-IP     'Extract the string '5520849900' without -IP
 1205310691-OBV    'Extract the string without the -OBV
 R-025257413-001   'Numbers of this form should also be allowed.
 NO PCT 93660      'If string contains the word NO anywhere, it is not a match
 HOLDA5-208-4990   'If string contains the word HOLD anywhere, it is not a match

有人可以帮忙吗？

出于测试目的，这里是创建一个包含示例输入数据的表的 Sub：

 Sub CreateTestAuth()

 Dim dbs As Database
 Set dbs = CurrentDb

 With dbs
     .Execute "CREATE TABLE tbl_test_auth " _
         & "(AUTHSTR CHAR);"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5557653700 IP');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "(' R025257413-001');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('REF 120407175');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('SNK601M71016');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('U0504124 AMB');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('3681869/OBV ONL');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('REFA5-208-4990IP');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5520849900AMB');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5520849900CHEST');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5520849900-IP');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('1205310691-OBV');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('HOLDA5-208-4990');"
     .Close
 End With
 End Sub

score 1 · Accepted Answer

好的，起初我认为额外的要求会使正则表达式更长。
但是通过积极的前瞻，它实际上几乎是相同的大小。这次只有正则表达式：
\b(?=.*\d)([a-z0-9]+(?:-[a-z0-9]+)*)\b

或用注释分解（忽略空格）：

\b                     # Word start
  (?=.*\d)             # A number has to follow somewhere after this point
  (                    # Start capture group
    [a-z0-9]+          # At least one alphanum
    (?:-[a-z0-9]+)*    # Possibly more attached with hyphen
  )                    # End capture group
\b                     # Word end

但请注意，并非所有正则表达式风格都支持可变宽度前瞻。我不知道VBA之一。

第二个注意事项：(?=)如果数字出现在单词结尾之后，也将满足。所以在
DONT-RECOGNIZE-ME but-1-5ay-yes
中，粗体部分将被捕获。

score 0 · Accepted Answer

您的示例输入文件（此文件的路径 s/b 给function<GetMatches>as inputFilePath）：

5557653700 IP
R025257413-001
REF 120407175
SNK601M71016
U0504124 AMB
W892160
019870270000000
00Q926K2
A025229563
01615217 AMB
12042-0148
SNK601M71016
12096NHP174
12100-ACDE
12110-AANM
12114AD5QIP
REF-34555
3681869/OBV ONL

这是保存在文件中的垃圾（此文件的路径 s/b 给function<GetMatches>as replaceDBPath）：

^REF
IP$
^OBV
AMB$
CHEST$
-OBV$
^.*(NO|HOLD).*$

这里是bas：

Option Explicit
'This example uses the following references:
'Microsoft VBScript Regular Expressions 5.5 and Microsoft Scripting Runtime

Private fso As New Scripting.FileSystemObject
Private re As New VBScript_RegExp_55.RegExp

Private Function GetJunkList(fpath$) As String()
0     On Error GoTo errHandler
1     If fso.FileExists(fpath) Then
2         Dim junkList() As String, mts As MatchCollection, mt As Match, pos&, tmp$
3         tmp = fso.OpenTextFile(fpath).ReadAll()
4         With re
5             .Global = True
6             .MultiLine = True
7             .Pattern = "[^\r\n]+"
8             Set mts = .Execute(tmp)
9             ReDim junkList(mts.Count - 1)
10            For Each mt In mts
11                junkList(pos) = mt.Value
12                pos = pos + 1
13            Next mt
14        End With
15        GetJunkList = junkList
16    Else
17        MsgBox "File not found at:" & vbCr & fpath
18    End If
19    Exit Function
errHandler:
     Dim Msg$
     With Err
         Msg = "Error '" & .Number & " " & _
        .Description & "' occurred in " & _
        "Function<GetJunkList> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".")
     End With
     MsgBox Msg, vbCritical
End Function

Public Function GetMatches(replaceDBPath$, inputFilePath$) As String()
0     On Error GoTo errHandler
1     Dim junks() As String, junkPat$, tmp$, results() As String, pos&, mts As MatchCollection, mt As Match
2     junks = GetJunkList(replaceDBPath)
3     tmp = fso.OpenTextFile(inputFilePath).ReadAll
4
5     With re
6        .Global = True
7        .MultiLine = True
8        .IgnoreCase = True
9        For pos = LBound(junks) To UBound(junks)
10           .Pattern = junkPat
11           junkPat = junks(pos)
12           'replace junk with []
13           tmp = .Replace(tmp, "")
14       Next pos
15
16       'trim lines [if all input data in one line]
17       .Pattern = "^[ \t]*|[ \t]*$"
18       tmp = .Replace(tmp, "")
19
20       'create array using provided pattern
21       pos = 0
22       .Pattern = "\b[a-z]*[\d]+\-*\d*[a-z0-9]*\b"
23       Set mts = .Execute(tmp)
24       ReDim results(mts.Count - 1)
25       For Each mt In mts
26           results(pos) = mt.Value
27           pos = pos + 1
28       Next mt
29    End With
30
31    GetMatches = results
32    Exit Function
errHandler:
     Dim Msg$
     With Err
         Msg = "Error '" & .Number & " " & _
        .Description & "' occurred in " & _
        "Function<GetMatches> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".")
     End With
     MsgBox Msg, vbCritical
End Function

和样品测试仪

Public Sub tester()
    Dim samples() As String, s
    samples = GetMatches("C:\Documents and Settings\Cylian\Desktop\junks.lst", "C:\Documents and Settings\Cylian\Desktop\sample.txt")
    For Each s In samples
        MsgBox s
    Next
End Sub

可以从以下位置调用immediate window：

tester

希望这可以帮助。

score 0 · Accepted Answer

由于额外的过滤，我将使用两步法。

var splitter = new Regex(@"[\t\n\r]+", RegexOptions.Multiline);
const string INPUT = @"REFA5-208-4990IP
       OBV1213110379
       5520849900AMB
       5520849900CHEST
       5520849900-IP
       1205310691-OBV
       R-025257413-001
       NO PCT 93660
       HOLDA5-208-4990";
string[] lines = splitter.Split(INPUT);

var blacklist = new[] { "NO", "HOLD" };
var ignores = new[] { "REF", "IP", "CHEST", "AMB", "OBV" };

var filtered = from line in lines
         where blacklist.All(black => line.IndexOf(black) < 0)
         select ignores.Aggregate(line, (acc, remove) => acc.Replace(remove, ""));

var authorization = new Regex(@"\b([a-z0-9]+(?:-[a-z0-9]+)*)\b", RegexOptions.IgnoreCase);
foreach (string s in filtered)
{
  Console.Write("'{0}' ==> ", s);
  var match = authorization.Match(s);
  if (match.Success)
  {
    Console.Write(match.Value);
  }
  Console.WriteLine();
}

印刷：

'A5-208-4990' ==> A5-208-4990
' 1213110379' ==> 1213110379
' 5520849900' ==> 5520849900
' 5520849900' ==> 5520849900
' 5520849900-' ==> 5520849900
' 1205310691-' ==> 1205310691
' R-025257413-001' ==> R-025257413-001

score 0 · Accepted Answer

\b 开始是一个问题。还需要注意一些空格和一些破折号。试试这个“ [a-zA-Z|\s|-]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]*”。仅在授权号上运行。

score 0 · Accepted Answer

有时很容易让它松散，而不是死板地坚持一种或另一种方式。:)

尝试这个：

1 - 添加此功能

Public Function RemoveJunk(ByVal inputValue As String, ParamArray junkWords() As Variant) As String
    Dim junkWord
    For Each junkWord In junkWords
        inputValue = Replace(inputValue, junkWord, "", , , vbBinaryCompare)
    Next
    RemoveJunk = inputValue
End Function

2 - 现在你的任务很简单。请参阅下面的示例了解如何使用它：

Sub Sample()
    Dim theText As String
    theText = " REFA5-208-4990IP blah blah "
    theText = RemoveJunk(theText, "-REF", "REF", "-IP", "IP", "-OBV", "OBV") '<-- complete this in a similar way

    Debug.Print theText

    '' -- now apply the regexp here --


End Sub

RemoveJunk 函数调用的完成有点棘手。将较长的放在较短的之前。例如 -OBV 应该出现在“OBV”之前。

试一试，看看它是否能解决您的问题。

regex - 用于提取授权号的 RegEx 模式

5 回答 5

Related

Reference