1

First, I'm using EditPadPro for my regex cleaning, so any answers given should work within that environment.

I get a large spreadsheet full of data that I have to clean every day. I've managed to get it down to a couple of different regexes that I run, and this works... but I'm curious to see if it's possible to reduce down to a single regex.

Here is some sample data:

3-CPC_114851_70095_70095_CAN-bre
3-CPC_114851_70095_70095_CAN
b11-ao1-113775-bre
b7-ao-114441
b7-ao-114441-bre
b7-ao1-114441
b7-ao1-114441-bre
http://go.nlvid.com/results1/?http://bo
go.nlv/results1/?click
b4-sm-1359
b6-sm-1356-bre
1359_195_1453814569-bre
1356_104_1456856729
b15-rad-8905
b15-rad-8905-bre

Here is how the above data needs to end up:

114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre

So, there are numerous rules, such as:

  • In cases of more than 2 underscores, the result needs to contain only the value immediately after the first underscore, and everything from the dash onwards.
  • In cases where the string contains "-ao-", "-ao1-", everything prior to the final numeric string should be removed.
  • If a question mark is present, everything from the mark onwards should be removed.
  • If the string contains "-sm-" or "-rad-", everything prior to those alpha strings should be removed.
  • If the string contains 2 underscores, averything after the first numeric string up to a dash (if present) should be removed, and the string "sm-" should be prepended.

Additionally there is other data that must be left untouched, including but not limited to:

113535|24905|24905

as well as many variations on this pattern of xxxxxx|yyyyy|zzzzz (and not always those string lengths)

This may be asking way too much of regex, I'm not sure as I'm not great with it. But I've seen some pretty impressive things done with it, so I thought I'd put this out to the community and see what you come back with.

4

2 回答 2

1

乔纳森,我可以将所有这些包装到一个正则表达式中,除了最后一个(你sm-在不包含的字符串之前添加sm)。在这种情况下这是不可能的,因为我们无法捕获“sm”以在替换中重用,并且因为 EPP 中没有“条件替换”语法。

话虽如此,您可以在 EPP中使用两个正则表达式和一个将两者链接起来的宏来实现您想要的。

这里是如何。

以下解决方案在 EPP 中进行了测试。

正则表达式 1

  1. 按 Ctrl + Sh + F 进入搜索/替换模式
  2. 在相应的框中输入以下搜索和替换
  3. 在搜索栏的右上角,单击收藏夹搜索下拉菜单,选择“添加”,为其命名,例如 Regex 1

搜索:

(?mx)^
(?=(?:[^_\r\n]*?_){3})[^_\r\n]+?_([^_\r\n]+)[^-\r\n]+(-[^\r\n]+)?
|
[^\r\n]*?-ao1?-\D*([^\r\n]+)
|
([^\r\n?]*)(?=\?)[^\r\n]+
|
[^\r\n]*?-((?:sm|rad)-[^\r\n]+)

代替:

\1\2\3\4\5

正则表达式 2

与上述相同的 1-2-3 步骤。

搜索

^(?!(?:[^_\r\n]*?_){3})(?=(?:[^_\r\n]*?_){2})(\d+)(?:[^-\r\n]+(-[^\r\n]+)?)

代替

sm-\1\2

链接正则表达式 1 和正则表达式 2

  1. 顶部菜单:宏,录制宏,给它一个名字。
  2. 单击收藏夹搜索下拉菜单,选择 Regex 1
  3. 点击全部替换。
  4. 单击收藏搜索下拉菜单,选择 Regex 2
  5. 点击全部替换。
  6. 宏,停止录制。
  7. 每当您想要执行替换序列时,请在“宏”菜单下按名称将其拉出。

测试这个

我已经根据您的输入测试了我的“乔纳森宏”。结果如下:

114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre
于 2014-04-24T23:30:03.373 回答
1

尝试这个:

  1. 切换搜索面板:SHIFT+CTRL+F
  2. 搜索:.*?((?:sm-|rad-)?(?:(?:\d+|[\w\.]+\/.*?))(?:-\w+)?$)
  3. 代替:$1
  4. 检查REGEXWORDS
  5. 点击Replace All或点击CTRL+ALT+F3

检查下图:

EditPad 搜索和替换

于 2014-04-24T19:17:33.510 回答