2

我正在使用GSKinner 的 Reg Exr 工具来帮助提出一种模式,该模式可以在包含大量其他垃圾的字段中找到授权号。授权号是一个包含字母(有时)、数字(总是)和连字符(有时)的字符串(授权总是在某处包含一个数字,但并不总是包含连字符和字母)。此外,授权号可以位于我正在搜索的字段中的任何位置。

正确授权号的示例包括:

 5555834384734  ' All digits
 12110-AANM     ' Alpha plus digits, plus hyphens
 R-455545-AB-9  ' Alpha plus digits, plus multiple hyphens
 R-45-54A-AB-9  ' Alpha plus digits, plus multiple hyphens
 W892160        ' Alpha plus digits without hypens

这是一些带有额外垃圾的示例数据,有时会用连字符或没有空格附加到真实授权号,使其看起来像数字的一部分。垃圾以可预测的形式/单词出现:REF、CHEST、IP、AMB、OBV 和 HOLD,它们不是授权号的一部分。

 5557653700 IP
 R025257413-001
 REF 120407175
 SNK601M71016
 U0504124 AMB
 W892160
 019870270000000
 00Q926K2
 A025229563
 01615217 AMB
 12042-0148
 SNK601M71016
 12096NHP174
 12100-ACDE
 12110-AANM
 12114AD5QIP
 REF-34555
 3681869/OBV ONL

这是我正在使用的模式:

 "\b[a-zA-Z]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]*"

我正在学习 RegExp,所以它无疑可以改进,但它适用于上述情况,但不适用于以下情况:

 REFA5-208-4990IP  'Extract the string 'A5-208-4990'without REF or IP
 OBV1213110379     'Extract the string '1213110379' without the OBV
 5520849900AMB     'Extract the string '5520849900' without AMB
 5520849900CHEST   'Extract the string '5520849900' without CHEST
 5520849900-IP     'Extract the string '5520849900' without -IP
 1205310691-OBV    'Extract the string without the -OBV
 R-025257413-001   'Numbers of this form should also be allowed.
 NO PCT 93660      'If string contains the word NO anywhere, it is not a match
 HOLDA5-208-4990   'If string contains the word HOLD anywhere, it is not a match

有人可以帮忙吗?

出于测试目的,这里是创建一个包含示例输入数据的表的 Sub:

 Sub CreateTestAuth()

 Dim dbs As Database
 Set dbs = CurrentDb

 With dbs
     .Execute "CREATE TABLE tbl_test_auth " _
         & "(AUTHSTR CHAR);"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5557653700 IP');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "(' R025257413-001');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('REF 120407175');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('SNK601M71016');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('U0504124 AMB');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('3681869/OBV ONL');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('REFA5-208-4990IP');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5520849900AMB');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5520849900CHEST');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('5520849900-IP');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('1205310691-OBV');"
     .Execute " INSERT INTO tbl_test_auth " _
         & "(AUTHSTR) VALUES " _
         & "('HOLDA5-208-4990');"
     .Close
 End With
 End Sub
4

5 回答 5

1

好的,起初我认为额外的要求会使正则表达式更长
但是通过积极的前瞻,它实际上几乎是相同的大小。这次只有正则表达式:
\b(?=.*\d)([a-z0-9]+(?:-[a-z0-9]+)*)\b

或用注释分解(忽略空格):

\b                     # Word start
  (?=.*\d)             # A number has to follow somewhere after this point
  (                    # Start capture group
    [a-z0-9]+          # At least one alphanum
    (?:-[a-z0-9]+)*    # Possibly more attached with hyphen
  )                    # End capture group
\b                     # Word end

但请注意,并非所有正则表达式风格都支持可变宽度前瞻。我不知道VBA之一。

第二个注意事项:(?=)如果数字出现在单词结尾之后,也将满足。所以在
DONT-RECOGNIZE-ME but-1-5ay-yes
中,粗体部分将被捕获。

于 2012-05-30T20:39:23.927 回答
0

您的示例输入文件(此文件的路径 s/b 给function<GetMatches>as inputFilePath):

5557653700 IP
R025257413-001
REF 120407175
SNK601M71016
U0504124 AMB
W892160
019870270000000
00Q926K2
A025229563
01615217 AMB
12042-0148
SNK601M71016
12096NHP174
12100-ACDE
12110-AANM
12114AD5QIP
REF-34555
3681869/OBV ONL

这是保存在文件中的垃圾(此文件的路径 s/b 给function<GetMatches>as replaceDBPath):

^REF
IP$
^OBV
AMB$
CHEST$
-OBV$
^.*(NO|HOLD).*$

这里是bas

Option Explicit
'This example uses the following references:
'Microsoft VBScript Regular Expressions 5.5 and Microsoft Scripting Runtime

Private fso As New Scripting.FileSystemObject
Private re As New VBScript_RegExp_55.RegExp

Private Function GetJunkList(fpath$) As String()
0     On Error GoTo errHandler
1     If fso.FileExists(fpath) Then
2         Dim junkList() As String, mts As MatchCollection, mt As Match, pos&, tmp$
3         tmp = fso.OpenTextFile(fpath).ReadAll()
4         With re
5             .Global = True
6             .MultiLine = True
7             .Pattern = "[^\r\n]+"
8             Set mts = .Execute(tmp)
9             ReDim junkList(mts.Count - 1)
10            For Each mt In mts
11                junkList(pos) = mt.Value
12                pos = pos + 1
13            Next mt
14        End With
15        GetJunkList = junkList
16    Else
17        MsgBox "File not found at:" & vbCr & fpath
18    End If
19    Exit Function
errHandler:
     Dim Msg$
     With Err
         Msg = "Error '" & .Number & " " & _
        .Description & "' occurred in " & _
        "Function<GetJunkList> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".")
     End With
     MsgBox Msg, vbCritical
End Function

Public Function GetMatches(replaceDBPath$, inputFilePath$) As String()
0     On Error GoTo errHandler
1     Dim junks() As String, junkPat$, tmp$, results() As String, pos&, mts As MatchCollection, mt As Match
2     junks = GetJunkList(replaceDBPath)
3     tmp = fso.OpenTextFile(inputFilePath).ReadAll
4
5     With re
6        .Global = True
7        .MultiLine = True
8        .IgnoreCase = True
9        For pos = LBound(junks) To UBound(junks)
10           .Pattern = junkPat
11           junkPat = junks(pos)
12           'replace junk with []
13           tmp = .Replace(tmp, "")
14       Next pos
15
16       'trim lines [if all input data in one line]
17       .Pattern = "^[ \t]*|[ \t]*$"
18       tmp = .Replace(tmp, "")
19
20       'create array using provided pattern
21       pos = 0
22       .Pattern = "\b[a-z]*[\d]+\-*\d*[a-z0-9]*\b"
23       Set mts = .Execute(tmp)
24       ReDim results(mts.Count - 1)
25       For Each mt In mts
26           results(pos) = mt.Value
27           pos = pos + 1
28       Next mt
29    End With
30
31    GetMatches = results
32    Exit Function
errHandler:
     Dim Msg$
     With Err
         Msg = "Error '" & .Number & " " & _
        .Description & "' occurred in " & _
        "Function<GetMatches> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".")
     End With
     MsgBox Msg, vbCritical
End Function

和样品测试仪

Public Sub tester()
    Dim samples() As String, s
    samples = GetMatches("C:\Documents and Settings\Cylian\Desktop\junks.lst", "C:\Documents and Settings\Cylian\Desktop\sample.txt")
    For Each s In samples
        MsgBox s
    Next
End Sub

可以从以下位置调用immediate window

tester

希望这可以帮助。

于 2012-05-30T06:07:34.900 回答
0

由于额外的过滤,我将使用两步法。

var splitter = new Regex(@"[\t\n\r]+", RegexOptions.Multiline);
const string INPUT = @"REFA5-208-4990IP
       OBV1213110379
       5520849900AMB
       5520849900CHEST
       5520849900-IP
       1205310691-OBV
       R-025257413-001
       NO PCT 93660
       HOLDA5-208-4990";
string[] lines = splitter.Split(INPUT);

var blacklist = new[] { "NO", "HOLD" };
var ignores = new[] { "REF", "IP", "CHEST", "AMB", "OBV" };

var filtered = from line in lines
         where blacklist.All(black => line.IndexOf(black) < 0)
         select ignores.Aggregate(line, (acc, remove) => acc.Replace(remove, ""));

var authorization = new Regex(@"\b([a-z0-9]+(?:-[a-z0-9]+)*)\b", RegexOptions.IgnoreCase);
foreach (string s in filtered)
{
  Console.Write("'{0}' ==> ", s);
  var match = authorization.Match(s);
  if (match.Success)
  {
    Console.Write(match.Value);
  }
  Console.WriteLine();
}

印刷:

'A5-208-4990' ==> A5-208-4990
' 1213110379' ==> 1213110379
' 5520849900' ==> 5520849900
' 5520849900' ==> 5520849900
' 5520849900-' ==> 5520849900
' 1205310691-' ==> 1205310691
' R-025257413-001' ==> R-025257413-001
于 2012-05-29T13:28:23.407 回答
0

\b 开始是一个问题。还需要注意一些空格和一些破折号。试试这个“ [a-zA-Z|\s|-]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]*”。仅在授权号上运行。

于 2012-05-29T13:08:57.640 回答
0

有时很容易让它松散,而不是死板地坚持一种或另一种方式。:)

尝试这个:

1 - 添加此功能

Public Function RemoveJunk(ByVal inputValue As String, ParamArray junkWords() As Variant) As String
    Dim junkWord
    For Each junkWord In junkWords
        inputValue = Replace(inputValue, junkWord, "", , , vbBinaryCompare)
    Next
    RemoveJunk = inputValue
End Function

2 - 现在你的任务很简单。请参阅下面的示例了解如何使用它:

Sub Sample()
    Dim theText As String
    theText = " REFA5-208-4990IP blah blah "
    theText = RemoveJunk(theText, "-REF", "REF", "-IP", "IP", "-OBV", "OBV") '<-- complete this in a similar way

    Debug.Print theText

    '' -- now apply the regexp here --


End Sub

RemoveJunk 函数调用的完成有点棘手。将较长的放在较短的之前。例如 -OBV 应该出现在“OBV”之前。

试一试,看看它是否能解决您的问题。

于 2012-05-29T13:28:30.537 回答