java - 尝试使用 Web Harvest 从网站中提取 URL

Question

我正在尝试提取没有站点地图的网站的 URL。我正在使用Web Harvest 工具

我对 Java 或编码一无所知。有人可以帮我使用这个工具。

我希望它在特定网站（例如 example.com）上运行并从该网站提取每个 URL。

score 1 · Accepted Answer

Example.com 不是一个很好的例子，因为它只有一个链接！:)

这是我的带有一些注释的代码：

<?xml version="1.0" encoding="UTF-8"?>

<config>
        <!-- 1: provide inputs           -->  
        <script><![CDATA[
                url="http://stackoverflow.com/questions/17635763/trying-to-extract-urls-from-a-website-using-web-harvest";

                output_path = "C:/webharvest/"; 
                file_name = "urllist.txt";              
                output_file = output_path + file_name;                  

            ]]></script>

        <!-- 5 : save the resulting list in a variable       -->    
        <var-def name="urls">
            <!-- 4 : select only links (outputs a list variable)         -->    
            <xpath expression='//a/@href'>
                <!-- 3 : convert it to XML, for querying         --> 
                <html-to-xml>
                    <!-- 2 : load the page       -->  
                    <http url="${url}"/>
                </html-to-xml>
            </xpath>
        </var-def>

        <!-- 7: write to output file         -->  
        <file action="write" path="${output_file}">
            <!-- 6 : convert the list variable into a string with each link on a new line        -->  
            <text delimiter="${sys.cr}${sys.lf}">
            <var name="urls" />
            </text>
        </file>              

</config>

score 0 · Accepted Answer

您应该在http://web-harvest.sourceforge.net/manual.php浏览 Web 收获用户手册，其中包含多个示例。

java - 尝试使用 Web Harvest 从网站中提取 URL

2 回答 2

Related

Reference