4

前几天我试着问这个,诚然一开始没有很好地表达这个问题或邮政编码,答案被关闭了。所以我在这里再次尝试,因为老实说,这让我很快发疯。:)

我正在尝试实现这个Address Parser,它最初是一个基于控制台的 c# 程序。我已成功地将其转换为一个独立的 WPF 程序,该程序仅包含一个TextBox用于输入、一个Button用于激活解析和一个TextBlock用于显示结果。在写这篇文章时,我确实将输出截断为我在主程序中需要的内容,但它仍然可以正常工作。我已经在下面包含了整个代码。

我的下一步是将其移植到我的主程序中,我通过使用复制/粘贴来完成。但是,在运行此程序时,程序会在按下按钮后挂起。最终 VS 给出了一个错误,即进程时间过长而没有输出消息,并且 TaskManager 中的内存使用量逐渐从 ~70k 增加到 3,000,000。针对这种情况,我将Parsing方法分配给了一个后台worker,希望能减轻主进程的工作量。这确实解决了程序冻结的问题,但后台线程只是做了同样的事情,提高了 RAM 使用率并且什么也不返回。

所以现在我陷入了僵局。我知道问题出在var result = parser.ParseAddress(input);语句的某个地方,因为当对每一行代码使用断点时,这是最后一个触发的。但基本上我不明白为什么这会在一个 WPF 程序而不是另一个程序中导致问题。

如果有必要,我会非常乐意将主程序的完整源代码发布到某个地方,但我无法想象在这里发布大约 20 个不同的类文件和项目的代码会是个好主意。:)

独立的 WPF 应用程序

namespace AddressParseWPF
{
    /// <summary>
    /// Interaction logic for MainWindow.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        public MainWindow()
        {
            InitializeComponent();
        }

        public void Execute()
        {
            AddressParser.AddressParser parser = new AddressParser.AddressParser();
            var input = inputTextBox.Text;

            var result = parser.ParseAddress(input);

            if (result == null)
            {
                outputTextBlock.Text = "ERROR. Input could not be parsed.";
            }
            else
            {
                outputTextBlock.Text = (result.StreetLine + ", " + result.City + ", " + result.State + "  " + result.Zip);
            }
        }

        private void actionButton_Click(object sender, RoutedEventArgs e)
        {
            Execute();
        }
    }
}

将 Parser 移植到的主程序

public void ExecuteAddressParse()
{
    AddressParser.AddressParser parser = new AddressParser.AddressParser();
    var input = inputTextBox.Text;

    var result = parser.ParseAddress(input);

    if (result == null)
    {
        outputTextBlock.Text = "ERROR. Input could not be parsed.";
    }
    else
    {
        outputTextBlock.Text = (result.StreetLine + ", " + result.City + ", " + result.State + "  " + result.Zip);
    }
}       

private void actionButton_Click(object sender, RoutedEventArgs e)
{
    ExecuteAddressParse();
}

ParseAddress 方法

public AddressParseResult ParseAddress(string input)
{
    if (!string.IsNullOrWhiteSpace(input))
    {
        var match = addressRegex.Match(input.ToUpperInvariant());
        if (match.Success)
        {
            var extracted = GetApplicableFields(match);
            return new AddressParseResult(Normalize(extracted));
        }
    }

    return null;
}

正则表达式匹配方法

private static void InitializeRegex()
{
    var suffixPattern = new Regex(
        string.Join(
            "|",
            new [] {
                string.Join("|", suffixes.Keys), 
                string.Join("|", suffixes.Values.Distinct())
            }),
        RegexOptions.Compiled);

    var statePattern = 
        @"\b(?:" + 
        string.Join(
            "|",
            new [] {
                string.Join("|", states.Keys.Select(x => Regex.Escape(x))),
                string.Join("|", states.Values)
            }) +
        @")\b";

    var directionalPattern =
        string.Join(
            "|",
            new [] {
                string.Join("|", directionals.Keys),
                string.Join("|", directionals.Values),
                string.Join("|", directionals.Values.Select(x => Regex.Replace(x, @"(\w)", @"$1\.")))
            });

    var zipPattern = @"\d{5}(?:-?\d{4})?";

    var numberPattern =
        @"(
            ((?<NUMBER>\d+)(?<SECONDARYNUMBER>(-[0-9])|(\-?[A-Z]))(?=\b))    # Unit-attached
            |(?<NUMBER>\d+[\-\ ]?\d+\/\d+)                                   # Fractional
            |(?<NUMBER>\d+-?\d*)                                             # Normal Number
            |(?<NUMBER>[NSWE]\ ?\d+\ ?[NSWE]\ ?\d+)                          # Wisconsin/Illinois
          )";

    var streetPattern =
        string.Format(
            CultureInfo.InvariantCulture,
            @"
                (?:
                  # special case for addresses like 100 South Street
                  (?:(?<STREET>{0})\W+
                     (?<SUFFIX>{1})\b)
                  |
                  (?:(?<PREDIRECTIONAL>{0})\W+)?
                  (?:
                    (?<STREET>[^,]*\d)
                    (?:[^\w,]*(?<POSTDIRECTIONAL>{0})\b)
                   |
                    (?<STREET>[^,]+)
                    (?:[^\w,]+(?<SUFFIX>{1})\b)
                    (?:[^\w,]+(?<POSTDIRECTIONAL>{0})\b)?
                   |
                    (?<STREET>[^,]+?)
                    (?:[^\w,]+(?<SUFFIX>{1})\b)?
                    (?:[^\w,]+(?<POSTDIRECTIONAL>{0})\b)?
                  )
                )
            ",
            directionalPattern,
            suffixPattern);

    var rangedSecondaryUnitPattern =
        @"(?<SECONDARYUNIT>" +
        string.Join("|", rangedSecondaryUnits.Keys) +
        @")(?![a-z])";
    var rangelessSecondaryUnitPattern =
        @"(?<SECONDARYUNIT>" +
        string.Join(
            "|",
            string.Join("|", rangelessSecondaryUnits.Keys)) +
        @")\b";
    var allSecondaryUnitPattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            (
                (:?
                    (?: (?:{0} \W*)
                        | (?<SECONDARYUNIT>\#)\W*
                    )
                    (?<SECONDARYNUMBER>[\w-]+)
                )
                |{1}
            ),?
        ",
         rangedSecondaryUnitPattern,
         rangelessSecondaryUnitPattern);

    var cityAndStatePattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            (?:
                (?<CITY>[^\d,]+?)\W+
                (?<STATE>{0})
            )
        ",
        statePattern);
    var placePattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            (?:{0}\W*)?
            (?:(?<ZIP>{1}))?
        ",
        cityAndStatePattern,
        zipPattern);

    var addressPattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            ^
            # Special case for APO/FPO/DPO addresses
            (
                [^\w\#]*
                (?<STREETLINE>.+?)
                (?<CITY>[AFD]PO)\W+
                (?<STATE>A[AEP])\W+
                (?<ZIP>{4})
                \W*
            )
            |
            # Special case for PO boxes
            (
                \W*
                (?<STREETLINE>(P[\.\ ]?O[\.\ ]?\ )?BOX\ [0-9]+)\W+
                {3}
                \W*
            )
            |
            (
                [^\w\#]*    # skip non-word chars except # (eg unit)
                (  {0} )\W*
                   {1}\W+
                (?:{2}\W+)?
                   {3}
                \W*         # require on non-word chars at end
            )
            $           # right up to end of string
        ",
        numberPattern,
        streetPattern,
        allSecondaryUnitPattern,
        placePattern,
        zipPattern);
    addressRegex = new Regex(
        addressPattern,
        RegexOptions.Compiled | 
        RegexOptions.Singleline | 
        RegexOptions.IgnorePatternWhitespace);
}
4

3 回答 3

6

省略RegexOptions.Compiled标志时,正则表达式是否有效?

答复是肯定的。

所以为什么?

似乎正则表达式编译器对于(一些?)大模式很慢。

这是你必须做出的权衡。

于 2012-05-21T16:07:25.990 回答
1

一些正则表达式子表达式是不恰当的(正如@Justin Morgan 所提到的)。
这通常是加入可重复使用的零散正则表达式的结果,这让
我感到畏缩。

但是,如果您打算使用/执行这种方法,那么
在构建后打印出实际的正则表达式总是一个好主意。并且,在格式化之后,针对
样本进行测试,并独立于您的主程序进行测试。这样更容易修复。
如果您看到可疑的子表达式,请尝试在该点使其失败,或者
通常尝试在样本末尾附近插入失败。如果超过
一眨眼的功夫就失败了,那么它就严重地倒退了。

不过回溯也不错。它有很大的好处。没有它,有些事情
就无法匹配。诀窍是隔离不影响
结果相对于它周围的结果的子表达式,然后限制它的回溯。

我去了 USPS 站点并抓取了一些样本状态/后缀/方向/辅助
样本,足以生成地址正则表达式。
下面是从您的代码生成的正则表达式的清理版本。

祝你好运!

 ^
   # Special case for APO/FPO/DPO addresses
   (
      [^\w\#]*
      (?<STREETLINE> .+? )
      (?<CITY> [AFD] PO )
      \W+
      (?<STATE> A [AEP] )
      \W+
      (?<ZIP> \d{5} (?: -? \d{4} )? )
      \W*
   )
 |         
   # Special case for PO boxes
   (
      \W*
      (?<STREETLINE> ( P [\.\ ]? O [\.\ ]? \  )? BOX \  [0-9]+ )
      \W+
      (?:
          (?:
              (?<CITY> [^\d,]+? )
              \W+
              (?<STATE>
                 \b
                 (?:AL|AK|AS|AZ|AR|Alabama|Alaska|American Samoa|Arizona|Arkansas)
                 \b
              )
          )
          \W*
      )?
      (?:
          (?<ZIP> \d{5} (?: -? \d{4} )? )
      )?
      \W*
   )
 |          
   (
       [^\w\#]*    # skip non-word chars except # (eg unit)
       (
         (
              (
                (?<NUMBER> \d+ )
                (?<SECONDARYNUMBER> (-[0-9]) | (\-?[A-Z]) )
                (?=\b)
              )                                                  # Unit-attached
           |          
             (?<NUMBER> \d+ [\-\ ]? \d+ \/ \d+ )                 # Fractional
           |
             (?<NUMBER> \d+ -? \d* )                             # Normal Number
           |
             (?<NUMBER>[NSWE]\ ?\d+\ ?[NSWE]\ ?\d+)              # Wisconsin/Illinois
         )
       )
       \W*

       (?:
           # special case for addresses like 100 South Street
           (?:
               (?<STREET>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N\.|E\.|S\.|W\.|N\.E\.|S\.E\.|N\.W\.|S\.W\.)
               \W+
               (?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
               \b
           )
         |
           (?:
               (?<PREDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N\.|E\.|S\.|W\.|N\.E\.|S\.E\.|N\.W\.|S\.W\.)
               \W+
           )?
           (?:
                (?<STREET> [^,]* \d )
                (?:
                   [^\w,]*
                   (?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N\.|E\.|S\.|W\.|N\.E\.|S\.E\.|N\.W\.|S\.W\.)
                   \b
                )
             |
                (?<STREET> [^,]+ )
                (?:
                    [^\w,]+
                    (?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
                    \b
                )
                (?:
                    [^\w,]+
                    (?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N\.|E\.|S\.|W\.|N\.E\.|S\.E\.|N\.W\.|S\.W\.)
                    \b
                )?
             |
                (?<STREET> [^,]+? )
                (?:
                    [^\w,]+
                    (?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
                    \b
                )?
                (?:
                    [^\w,]+
                    (?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N\.|E\.|S\.|W\.|N\.E\.|S\.E\.|N\.W\.|S\.W\.)
                    \b
                )?
           )
       )           

       \W+        

       (?:      
           (
               (
                  :?
                  (?:
                      (?:
                         (?<SECONDARYUNIT>APT|BLDG|DEPT|FL|HNGR|LOT|PIER|RM|SLIP|SPC|STOP|STE|TRLR|UNIT)
                         (?! [a-z] )
                         \W*
                       )
                    |
                       (?<SECONDARYUNIT> \# )
                       \W*
                  )
                  (?<SECONDARYNUMBER> [\w-]+ )
               )
             |
               (?<SECONDARYUNIT>BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR)
               \b
           )
           ,?
           \W+
       )?

       (?:
           (?:
               (?<CITY> [^\d,]+? )
               \W+
               (?<STATE>
                  \b
                  (?:AL|AK|AS|AZ|AR|Alabama|Alaska|American Samoa|Arizona|Arkansas)
                  \b
               )
           )
           \W*
       )?

       (?:
           (?<ZIP> \d{5} (?: -? \d{4} )? )
       )?

       \W*         # require on non-word chars at end
   )
 $           # right up to end of string

C# 代码

   public static void InitializeRegex()
    {
        Dictionary<string, string> suffixes = new Dictionary<string, string>()
        {
          {"ALLEY",  "ALLEE"},
          {"ALY",  "ALLEY"},
          {"ALLY",  "ALY"},
        };

        var suffixPattern = new Regex(
            string.Join(
                "|",
                new[] {
            string.Join("|", suffixes.Keys.ToArray()), 
            string.Join("|", suffixes.Values.Distinct().ToArray())
        }),
            RegexOptions.Compiled);

        //Console.WriteLine("\n"+suffixPattern);

        Dictionary<string, string> states = new Dictionary<string, string>()
        {
           {"AL", "Alabama"},
           {"AK", "Alaska"},
           {"AS",  "American Samoa"},
           {"AZ",  "Arizona"},
           {"AR", "Arkansas"}
        };

        var statePattern =
            @"\b(?:" +
            string.Join(
                "|",
                new[] {
            string.Join("|", states.Keys.Select(x => Regex.Escape(x)).ToArray()),
            string.Join("|", states.Values.ToArray())
        }) +
            @")\b";

        //Console.WriteLine("\n" + statePattern);

        Dictionary<string, string> directionals = new Dictionary<string, string>()
        {
           {"North", "N" },
           {"East", "E" },
           {"South", "S" },
           {"West", "W" },
           {"Northeast", "NE" },
           {"Southeast", "SE" },
           {"Northwest", "NW" },
           {"Southwest", "SW" }
        };

        var directionalPattern =
            string.Join(
                "|",
                new[] {
            string.Join("|", directionals.Keys.ToArray()),
            string.Join("|", directionals.Values.ToArray()),
            string.Join("|", directionals.Values.Select(x => Regex.Replace(x, @"(\w)", @"$1\.")).ToArray())
        });

        //Console.WriteLine("\n" + directionalPattern);

        var zipPattern = @"\d{5}(?:-?\d{4})?";

        //Console.WriteLine("\n" + zipPattern);

        var numberPattern =
            @"(
                ((?<NUMBER>\d+)(?<SECONDARYNUMBER>(-[0-9])|(\-?[A-Z]))(?=\b))    # Unit-attached
                |(?<NUMBER>\d+[\-\ ]?\d+\/\d+)                                   # Fractional
                |(?<NUMBER>\d+-?\d*)                                             # Normal Number
                |(?<NUMBER>[NSWE]\ ?\d+\ ?[NSWE]\ ?\d+)                          # Wisconsin/Illinois
             )";

        //Console.WriteLine("\n" + numberPattern);

        var streetPattern =
            string.Format(
                CultureInfo.InvariantCulture,
                @"
                    (?:
                      # special case for addresses like 100 South Street
                      (?:(?<STREET>{0})\W+
                         (?<SUFFIX>{1})\b)
                      |
                      (?:(?<PREDIRECTIONAL>{0})\W+)?
                      (?:
                        (?<STREET>[^,]*\d)
                        (?:[^\w,]*(?<POSTDIRECTIONAL>{0})\b)
                       |
                        (?<STREET>[^,]+)
                        (?:[^\w,]+(?<SUFFIX>{1})\b)
                        (?:[^\w,]+(?<POSTDIRECTIONAL>{0})\b)?
                       |
                        (?<STREET>[^,]+?)
                        (?:[^\w,]+(?<SUFFIX>{1})\b)?
                        (?:[^\w,]+(?<POSTDIRECTIONAL>{0})\b)?
                      )
                    )
                ",
                directionalPattern,
                suffixPattern);

        //Console.WriteLine("\n" + streetPattern);


        Dictionary<string, string> rangedSecondaryUnits = new Dictionary<string, string>()
        {
            {"APT",  "APARTMENT"},
            {"BLDG", "BUILDING"}, 
            {"DEPT", "DEPARTMENT"}, 
            {"FL",   "FLOOR"}, 
            {"HNGR", "HANGAR"}, 
            {"LOT",  "LOT"}, 
            {"PIER", "PIER"}, 
            {"RM",   "ROOM"}, 
            {"SLIP", "SLIP"}, 
            {"SPC",  "SPACE"}, 
            {"STOP", "STOP"}, 
            {"STE",  "SUITE"}, 
            {"TRLR", "TRAILER"}, 
            {"UNIT", "UNIT"} 
        };
        var rangedSecondaryUnitPattern =
            @"(?<SECONDARYUNIT>" +
            string.Join("|", rangedSecondaryUnits.Keys.ToArray()) +
            @")(?![a-z])";

        //Console.WriteLine("\n" + rangedSecondaryUnitPattern);


        Dictionary<string, string> rangelessSecondaryUnits = new Dictionary<string, string>()
        {
            {"BSMT", "BASEMENT"},
            {"FRNT", "FRONT"},
            {"LBBY", "LOBBY"},
            {"LOWR", "LOWER"},
            {"OFC",  "OFFICE"},
            {"PH",   "PENTHOUSE"},
            {"REAR", "REAR"},
            {"SIDE", "SIDE"},
            {"UPPR", "UPPER"}
        };

        var rangelessSecondaryUnitPattern =
            @"(?<SECONDARYUNIT>" +
            string.Join("|", rangelessSecondaryUnits.Keys.ToArray()) +
            @")\b";

        //Console.WriteLine("\n" + rangelessSecondaryUnitPattern);

        var allSecondaryUnitPattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                (
                    (:?
                        (?: (?:{0} \W*)
                            | (?<SECONDARYUNIT>\#)\W*
                        )
                        (?<SECONDARYNUMBER>[\w-]+)
                    )
                    |{1}
                ),?
            ",
             rangedSecondaryUnitPattern,
             rangelessSecondaryUnitPattern);

        //Console.WriteLine("\n" + allSecondaryUnitPattern);

        var cityAndStatePattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                (?:
                    (?<CITY>[^\d,]+?)\W+
                    (?<STATE>{0})
                )
            ",
            statePattern);

        //Console.WriteLine("\n" + cityAndStatePattern);

        var placePattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                (?:{0}\W*)?
                (?:(?<ZIP>{1}))?
            ",
            cityAndStatePattern,
            zipPattern);

        //Console.WriteLine("\n" + placePattern);

        var addressPattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                ^
                # Special case for APO/FPO/DPO addresses
                (
                    [^\w\#]*
                    (?<STREETLINE>.+?)
                    (?<CITY>[AFD]PO)\W+
                    (?<STATE>A[AEP])\W+
                    (?<ZIP>{4})
                    \W*
                )
                |
                # Special case for PO boxes
                (
                    \W*
                    (?<STREETLINE>(P[\.\ ]?O[\.\ ]?\ )?BOX\ [0-9]+)\W+
                    {3}
                    \W*
                )
                |
                (
                    [^\w\#]*    # skip non-word chars except # (eg unit)
                    (  {0} )\W*
                       {1}\W+
                    (?:{2}\W+)?
                       {3}
                    \W*         # require on non-word chars at end
                )
                $           # right up to end of string
            ",
            numberPattern,
            streetPattern,
            allSecondaryUnitPattern,
            placePattern,
            zipPattern);

        Console.WriteLine("\n-----------------------------\n\n" + addressPattern);

        var addressRegex = new Regex(
            addressPattern,
            RegexOptions.Compiled |
            RegexOptions.Singleline |
            RegexOptions.IgnorePatternWhitespace);

    }
于 2012-05-22T22:09:36.667 回答
0

像这样逐渐增加资源使用是灾难性回溯的确凿证据。基本上,如果你有类似的东西,比如这部分:

(?<CITY>[^\d,]+?)\W+

...那么输入的哪一部分与模式的哪一部分匹配就会有歧义。几乎任何匹配的东西\W也可以匹配[^\d,]。如果输入在第一次通过时无法匹配,引擎将返回并尝试这两组的不同排列,这会消耗资源。

例如,假设您输入的“城市”部分后面有一大堆空格。一长串空格将同时匹配[^\d,]+?and \W+,因此尚不清楚 CITY 组是否包含空格。基于这些量词的惰性/贪婪行为,引擎将尝试将城市名称放入. 然后它会继续前进并尝试匹配输入的其余部分。[^\d,]+?\W+

如果输入的其余部分在第一次尝试时匹配,那很好。但是,如果匹配失败,它将不得不返回并重试,这次将其中一个空间匹配[^\d,]+?并捕获为您的 CITY 组的一部分。如果失败,它将使用两个空格重试,依此类推。

您通常会看到这成为嵌套量词的问题,例如([ABC]+)*. 我在您的模式中看不到任何正在发生的事情,但我可能在所有string.Format电话中都错过了一些东西。我的猜测是它是一个很长的模式,有很多量词和交流器要回溯(还有很多组要存储),即使是单级迭代也会让你丧命。我敢打赌,长输入字符串匹配大多数模式,但无法匹配所有模式,您会获得最大的性能影响。

在这种情况下,编译正则表达式可能会有所帮助,您应该这样做。但是,当您的应用程序一次获得一千次(或多少次)点击时,我怀疑这会减少它。还会有某些输入字符串会导致大量回溯,并在性能方面给您带来更大的打击。我最大的建议是找到并解决模式中的歧义。

找到有很多量词彼此相似*+靠近的地方,并确保它们之间有清晰的、非可选的分隔符(例如,\d+-?\d*从您的 NUMBER 组中,性能会更好\d+(-\d*)?,或者更好\d+(-\d+)?\b)。最后,确保分隔符不能匹配它们旁边的标记。\W+\ \W+对于一个虚构的例子,如果你给它输入一长串空白,类似的东西会一直拖下去。

于 2012-05-21T16:42:27.953 回答