regex - how to read only URL from txt file in MATLAB

Question

I have a text file having multiple URLs with other information of the URL. How can I read the txt file and save the URLs only in an array to download it? I want to use

C = textscan(fileId, formatspec);

What should I mention in formatspec for URL as format?

score 4 · Accepted Answer

这不是一份工作textscan；你应该为此使用正则表达式。在 MATLAB 中，此处描述了正则表达式。对于 URL，另请参阅此处或此处以获取其他语言的示例。

这是 MATLAB 中的一个示例：

% This string is obtained through textscan or something
str = {...
    'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
    'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};


% find URLs    
C = regexpi(str, ...
    ['((http|https|ftp|file)://|www\.|ftp\.)',...
    '[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]'], 'match');

C{:}

结果：

ans = 
    'http://www.example.com/index.php?query=test&otherStuf=info'
ans = 
    'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'

请注意，此正则表达式要求您包含协议，或者具有前导www.或ftp.. example.com/universal_remote.cgi?redirect=类似的东西不匹配。

您可以继续使正则表达式涵盖越来越多的情况。然而，最终你会偶然发现一个最重要的结论（例如这里所做的；我的正则表达式是从哪里得到的）：鉴于准确构成有效 URL 的完整定义，没有一个正则表达式能够始终匹配每个有效的网址。也就是说，您可以构想出一些有效的 URL，这些 URL不会被显示的任何正则表达式捕获。

但请记住，最后这句话更多的是理论而不是实际——那些不可匹配的 URL 是有效的，但在实践中并不经常遇到 :) 换句话说，如果你的 URL 有一个非常标准的形式，那么你几乎覆盖着我给你的正则表达式。

现在，我对 pm89 提出的 Java 建议进行了一些愚弄。正如我所怀疑的那样，它比正则表达式慢一个数量级，因为您在代码中引入了另一个“粘性层”（在我的时间里，差异大约慢了 40 倍，不包括导入）。这是我的版本：

import java.net.URL;
import java.net.MalformedURLException;

str = {...
    'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
    'pre--URL garbage example.com/index.php?query=test&otherStuf=info more stuff here'
    'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};


% Attempt to convert each item into an URL.  
for ii = 1:numel(str)    
    cc = textscan(str{ii}, '%s');
    for jj = 1:numel(cc{1})
        try
            url = java.net.URL(cc{1}{jj})

        catch ME
            % rethrow any non-url related errors
            if isempty(regexpi(ME.message, 'MalformedURLException'))
                throw(ME);
            end

        end
    end
end

结果：

url =
    'http://www.example.com/index.php?query=test&otherStuf=info'
url =
    'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'

我不太熟悉java.net.URL，但显然，它也无法找到没有前导协议或标准域的 URL（例如，example.com/path/to/page）。

这个片段无疑可以改进，但我敦促您考虑为什么要为这个更长、本质上更慢且更丑陋的解决方案执行此操作:)

score 3 · Accepted Answer

正如我怀疑你可以java.net.URL根据这个答案使用。

在Matlab中实现相同的代码：

首先将文件读入字符串，fileread例如：

str = fileread('Sample.txt');

然后使用空格分割文本strsplit：

spl_str = strsplit(str);

最后用于java.net.URL检测 URL：

for k = 1:length(spl_str)
    try
       url = java.net.URL(spl_str{k})
       % Store or save the URL contents here
    catch e
       % it's not a URL.
    end
end

您可以使用将 URL 内容写入文件urlwrite。但首先将获得的 URL 转换java.net.URL为char：

url = java.net.URL(spl_str{k});
urlwrite(char(url), 'test.html');

希望能帮助到你。

regex - how to read only URL from txt file in MATLAB

2 回答 2

Related

Reference