python - 如何创建正则表达式模式以将字符从各种结构的字符串列表中提取出来？

Question

我正在使用正则表达式从地址字符串中提取字母“u”，但仅当它用作缩写时（u、u.、U、U. 等）。但是，我正在运行的问题是我拥有的字符串列表很乱并且充满了错误。我已经尝试从我在数据中看到的各种错误中提取我需要的东西。我知道我一定遗漏了一些小东西，但感谢您提供任何帮助。

我试过这些正则表达式：

(\s(u|U)?.?,?.?\s) <- 看起来有点时髦
[^\w+][uU]
[^\w+][uU][^tca]

我还有另一个解决这个问题的想法，这需要拆开地址（在街道、号码等之间分割），然后修复街道部分并将其粘在一起。我有一些运气实际上只是将数字部分拉出来：

(\d+-\d+|\d+/*\w*|(-))

但是，我想看看我在应该选择“u”的正则表达式中哪里搞砸了。Regex101.com 一直是我最好的朋友，如果没有它，我不会走到这一步。

test_strings = [
    "Holics u 5/a",
    "Holics U 5/a",
    "Holics u5/a",
    "Huolics u 5/a",
    "Holics u. 5/a",
    "Holuics u5",
    "Holics and other stuff u more stuff after 5",
    "Houlics utca 5"
]

# two regex patterns I have considered 

print("First regex pattern ------------------------------------")
pattern = r"[^\w+][uU]"
replacement_text = " utca "

for item in test_strings:
    print(re.sub(pattern,replacement_text,item))

print("\nSecond regex pattern ------------------------------------")
pattern = r"[^\w+][uU][^tca]"
replacement_text = " utca "

for item in test_strings:
    print(re.sub(pattern,replacement_text,item))

上述代码的结果：

第一个正则表达式模式：

Holics utca  5/a
Holics utca  5/a
Holics utca 5/a
Huolics utca  5/a
Holics utca . 5/a
Holuics utca 5
Holics and other stuff utca  more stuff after 5
Houlics utca tca 5 # <-------------------------------- issue

第二个正则表达式模式：

Holics utca 5/a
Holics utca 5/a
Holics utca /a # <----------------------------------- issue
Huolics utca 5/a
Holics utca  5/a
Holuics utca  <-------------------------------------- issue
Holics and other stuff utca more stuff after 5
Houlics utca 5

除了第一个正则表达式模式中的最后一行（“Houlics utca tca 5”）外，一切正常，当我尝试创建一个表达式来考虑包含“utca”的字符串时，我失去了像“Holics”这样的字符串中的数字u5/a。”

在大多数情况下，我希望结果是：

Holics你。5/a -----> Holics utca 5/a

最后一点，我有删除句点和空格的函数。

score 1 · Accepted Answer

您可以使用

re.sub(r'\b[uU](?=\b|\d)\.?\s*', 'utca ', s)

细节

\b- 单词边界
[uU]-u或U
(?=\b|\d)- 当前位置的右侧必须有单词边界或数字
\.?- 一个可选的点
\s*- 0+ 个空格。

或者，您可以使用

re.sub(r'\b[uU](?=\b|(?![^\W\d_]))\.?\s*', 'utca ', s)

请参阅正则表达式演示和另一个正则表达式演示。

(?![^\W\d_])在这里，如果下一个字符是字母，则不是数字要求，而是失败。

python - 如何创建正则表达式模式以将字符从各种结构的字符串列表中提取出来？

1 回答 1

Related

Reference