regex - 在 shell 脚本中使用正则表达式从字符串中提取 url

Question

我需要提取一个带有<strong>标签的 URL。这是一个简单的正则表达式，但我不知道如何在 shell 脚本中做到这一点。这是示例：

line="<strong>http://www.example.com/index.php</strong>"
url=$(echo $line | sed -n '/strong>(http:\/\/.+)<\/strong/p')

$url我需要变量中的“http://www.example.com/index.php” 。

使用忙箱。

score 1 · Accepted Answer

1

这可能有效：

url=$(echo $line | sed -r 's/<strong>([^<]+)<\/strong>/\1/')

于 2012-09-11T14:29:13.047 回答

score 0 · Accepted Answer

0

url=$(echo $line | sed -n 's!<strong>\(http://[^<]*\)</strong>!\1!p')

于 2012-09-11T14:29:13.920 回答

score 0 · Accepted Answer

您不必用反斜杠转义正斜杠。只有反斜杠需要在正则表达式中转义。?当 HTML 源代码中存在多个强标签时，您还应该使用与 -operator 一起使用的非贪婪匹配来避免获得更多。

strong>(http://.+?)</strong

score 0 · Accepted Answer

更新：由于busyboxuses ash，假设bash功能的解决方案可能不起作用。稍微长一点但仍然符合 POSIX 标准的东西可以工作：

url=${line#<strong>}  # $line minus the initial "<strong>"
url=${url%</strong>}  # Remove the trailing "</strong>"

如果您正在使用bash（或其他具有类似功能的 shell），您可以将扩展模式匹配与参数替换结合起来。（不知道busybox支持哪些功能。）

# Turn on extended pattern support
shopt -s extglob

# ?(\/) matches an optional forward slash; like /? in a regex
# Expand $line, but remove all occurrances of <strong> or </strong>
# from the expansion
url=${line//<?(\/)strong>}

regex - 在 shell 脚本中使用正则表达式从字符串中提取 url

4 回答 4

Related

Reference