regex - 仅使用 R 捕获 IP 地址

Question

我有 R 对象，其中包含域名和 IP 地址。例如。

11.22.44.55.test.url.com.localhost

我在 R 中使用正则表达式来捕获 IP 地址。我的问题是，当不匹配时，整个字符串都会匹配或“输出”。当我处理一个非常大的数据集时，这成为一个问题。我目前有以下使用正则表达式

sub("([0-9]+)\\.([0-9]+)\\.([0-9]+)\\.([0-9]+).*","\\1.\\2.\\3.\\4","11.22.44.55.test.url.com.localhost")

这给了我 11.22.44.55

11.22.44.55

但如果我不得不跟随

sub("([0-9]+)\\.([0-9]+)\\.([0-9]+)\\.([0-9]+).*","\\1.\\2.\\3.\\4","11.22.44.test.url.com.localhost")

然后它给了我

11.22.44.test.url.com.localhost

这实际上是不正确的。想知道是否有任何解决方案。

score 3 · Accepted Answer

您可以预处理grep以仅获取按照您想要的方式格式化的字符串，然后gsub在这些字符串上使用。

x <- c("11.22.44.55.test.url.com.localhost", "11.22.44.test.url.com.localhost")
gsub("((\\d+\\.){3}\\d+)(.*)", "\\1",  grep("(\\d+\\.){4}", x, value=TRUE))
#[1] "11.22.44.55"

score 1 · Accepted Answer

确实，您的代码正在运行。当sub()匹配失败时，它返回原始字符串。从手册：

对于 sub 和 gsub 返回一个与 x 具有相同长度和相同属性的字符向量（在可能强制转换为字符之后）。未被替换的字符向量 x 的元素将原封不动地返回（包括任何声明的编码）。如果 useBytes = FALSE，则非 ASCII 替代结果通常是带有标记编码的 UTF-8（例如，如果有 UTF-8 输入，并且在多字节语言环境中，除非 fixed = TRUE）。这样的字符串可以被 enc2native 重新编码。

强调添加

score 0 · Accepted Answer

你可以试试这个模式：

(?:\d{1,3}+\.){3}+\d{1,3}

我已经在 Java 中对其进行了测试：

static final Pattern p = Pattern.compile("(?:\\d{1,3}+\\.){3}+\\d{1,3}");

public static void main(String[] args) {
    final String s1 = "11.22.44.55.test.url.com.localhost";
    final String s2 = "11.24.55.test.url.com.localhost";
    System.out.println(getIps(s1));
    System.out.println(getIps(s2));
}

public static List<String> getIps(final String string) {
    final Matcher m = p.matcher(string);
    final List<String> strings = new ArrayList<>();
    while (m.find()) {
        strings.add(m.group());
    }
    return strings;
}

输出：

[11.22.44.55]
[]

score 0 · Accepted Answer

查看 gsubfn 包中的gsubfnorstrapply函数。当您想要返回匹配项而不是替换它时，这些函数比sub.

regex - 仅使用 R 捕获 IP 地址

4 回答 4

Related

Reference