java - 使用java从文本中删除url

Question

如何删除文本示例中存在的 URL

String str="Fear psychosis after #AssamRiots - http://www.google.com/LdEbWTgD http://www.yahoo.com/mksVZKBz";

使用正则表达式？

我想删除文本中的所有 URL。但它不起作用，我的代码是：

String pattern = "(http(.*?)\\s)";
Pattern pt = Pattern.compile(pattern);
Matcher namemacher = pt.matcher(input);
if (namemacher.find()) {
  str=input.replace(namemacher.group(0), "");
}

score 22 · Accepted Answer

输入String包含url的

private String removeUrl(String commentstr)
    {
        String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
        Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(commentstr);
        int i = 0;
        while (m.find()) {
            commentstr = commentstr.replaceAll(m.group(i),"").trim();
            i++;
        }
        return commentstr;
    }

score 5 · Accepted Answer

好吧，您还没有提供有关您的文本的任何信息，因此假设您的文本看起来像这样："Some text here http://www.example.com some text there"，您可以这样做：

String yourText = "blah-blah";
String cleartext = yourText.replaceAll("http.*?\\s", " ");

这将删除所有以“http”开头的序列，直到第一个空格字符。

您应该阅读String类的 Javadoc。它会让你清楚。

score 4 · Accepted Answer

你如何定义网址？您可能不仅想过滤 http://，还想过滤 https:// 和其他协议，如 ftp://、rss:// 或自定义协议。

也许这个正则表达式可以完成这项工作：

[\S]+://[\S]+

解释：

一个或多个非空格
后跟字符串“://”
后跟一个或多个非空格

score 4 · Accepted Answer

请注意，如果您的 URL 包含 & 和 \ 之类的字符，那么上面的答案将不起作用，因为 replaceAll 无法处理这些字符。对我有用的是删除新字符串变量中的这些字符，然后从 m.find() 的结果中删除这些字符并在我的新字符串变量上使用 replaceAll。

private String removeUrl(String commentstr)
{
    // rid of ? and & in urls since replaceAll can't deal with them
    String commentstr1 = commentstr.replaceAll("\\?", "").replaceAll("\\&", "");

    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    int i = 0;
    while (m.find()) {
        commentstr = commentstr1.replaceAll(m.group(i).replaceAll("\\?", "").replaceAll("\\&", ""),"").trim();
        i++;
    }
    return commentstr;
}

score 0 · Accepted Answer

m.group(0)应该用空字符串替换，而不是m.group(i)wherei每次调用都会增加，m.find()如上述答案之一中所述。

private String removeUrl(String commentstr)
{
    String urlPattern = "((https?|ftp|gopher|telnet|file|Unsure|http):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(commentstr);
    StringBuffer sb = new StringBuffer(commentstr.length);
    while (m.find()) {
        m.appendReplacement(sb, "");
    }
    return sb.toString();
}

score 0 · Accepted Answer

正如@Ev0oD 所提到的，除了我正在处理的以下推文之外，代码运行良好： RT @_Val83_: The cast of #ThorRagnarok playing "Ragnarok Paper Scissors" #TomHiddleston #MarkRuffalo (https://t.co /k9nYBu3QHu)

令牌将被删除的位置： commentstr = commentstr.replaceAll(m.group(i),"").trim();

我遇到了以下错误：

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 22

在哪里m.group(i)）https://t.co /k9nYBu3QHu`

score -3 · Accepted Answer

如果您可以继续使用 python，那么您可以使用这些代码在这里找到更好的解决方案，

import re
text = "<hello how are you ?> then ftp and mailto and gopher and file ftp://ideone.com/K3Cut rthen you "
text = re.sub(r"ftp\S+", "", result)
print(result)

java - 使用java从文本中删除url

7 回答 7

Related

Reference