regex - 来自 url 正则表达式的意外结果

Question

我正在尝试匹配 url 中的一部分。此 url 已被处理，仅包含域名。

例如：

我现在拥有的网址是 business.time.com 现在我想摆脱顶级域（.com）。我想要的结果是business.time

我正在使用以下代码：

gawk'{
match($1, /[a-zA-Z0-9\-\.]+[^(.com|.org|.edu|.gov|.mil)]/, where)
print where[0]
print where[1]
}' test

在测试中，有四行：

business.time.com
mybest.try.com
this.is.a.example.org
this.is.another.example.edu

我期待这个：

business.time

mybest.try

this.is.a.example

this.is.another.example

但是，输出是

business.t

mybest.try

this.is.a.examp

this.is.another.examp

谁能告诉我出了什么问题，我该怎么办？

谢谢

score 1 · Accepted Answer

为什么不使用点作为字段分隔符并执行以下操作： awk -F. 'sub(FS $NF,x)' test

或使用更易读的东西rev test|cut -d. -f 2-|rev，这样更容易阅读。

score 0 · Accepted Answer

问题是 [^] 仅用于排除单个字符，而不是表达式，因此您基本上有一个正则表达式：

match($1, /[a-zA-Z0-9\-\.]+[^()|.cedgilmoruv)]/, where)

这就是为什么它不能匹配ime.com的原因，buisiness.time.com因为所有这些字符都在 [^] 表达式中。

我找不到 gawk 的良好负面匹配，但确实构建了下面的内容，我希望对你有用：

match($1, /([a-zA-Z0-9\-\.]+)(\.com|\.org|\.edu|\.gov|\.mil)/, where)
print where[0]
print where[1]
print where[2]
> }' test

所以第一部分最终在 where[1] 和 where[2] 具有高级域

business.time.com
business.time
.com
mybest.try.com
mybest.try
.com
this.is.a.example.org
this.is.a.example
.org
this.is.another.example.edu
this.is.another.example
.edu

score 0 · Accepted Answer

你可以这样做：

rev domains.txt | cut -d '.' -f 2- | rev

但是，如果您有更复杂的终止要删除，您可以将 sed 与显式列表一起使用：

sed -r 's/\.(com(\.hk)?|org|edu|net|gov|mil)//' domains.txt

regex - 来自 url 正则表达式的意外结果

3 回答 3

Related

Reference