java - 如何在 Jsoup 中使用 html() 方法划分行

Question

我在 Jsoup 中按标签捕获元素时遇到了一个问题。方法 links.html() writen in 的返回String crawlingNode = links.html();将整个字符串写入 .txt 文件，没有空格或行分隔符。但是，在控制台中，它显示链接每行划分。所以，我需要问是否有一种方法可以在 .txt 文件中使用 html() 方法将链接划分为每行？因为对我来说，控制台上返回的方法显示为分开的并在 .txt 文件上我可以做同样的事情是没有意义的

ps：很抱歉没有给出一个较短的版本，但代码完全可以运行。专注于

Elements links = doc.getElementsByTag("cite");  
            String crawlingNode = links.html();
                crawlingNode = crawlingNode.replaceAll("(?=<).*?(>=?)", ""); //Remove undesired html tags
                    System.out.println(crawlingNode);
                        httptest.WriteOnFile(writer, crawlingNode);

部分，其中包含我要解决的问题。提前致谢！

public class httptest {

        static File file;
        File folder= null;
        String crawlingNode, date,  timeZone,Tag="Google Node";
        static BufferedWriter writer = null;
        static httptest ht;

        public httptest() throws IOException{

            date = new SimpleDateFormat("yyyy.MM.dd hh-mm-ss").format(new Date());
                folder = new File("queries/downloads/"+date+" "+TimeZone.getDefault().getDisplayName());
                    file = new File(folder.getPath()+"\\"+date+" "+Tag+".txt"); 
                        folder.mkdir();

        }

        private void GetLinks() throws IOException{

            Document doc = Jsoup.connect("http://google.com/search?q=mamamia")
                        .userAgent("Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.8.1.6) Gecko/20070723 Iceweasel/2.0.0.6 (Debian-2.0.0.6-0etch1)")
                        .cookie("auth", "token")
                        .timeout(3000)
                        .get();

                Elements links = doc.getElementsByTag("cite");  
                String crawlingNode = links.html();
                    crawlingNode = crawlingNode.replaceAll("(?=<).*?(>=?)", ""); //Remove undesired html tags
                        System.out.println(crawlingNode);
                            httptest.WriteOnFile(writer, crawlingNode);

        }


           private static void OpenWriter(File file){
               try {
                    writer = new BufferedWriter(new FileWriter(file));

            } catch (IOException e) {

                JOptionPane.showMessageDialog(null, "Failed to open URL Writer");
                    e.printStackTrace();

            }

           }

           private static void WriteOnFile(BufferedWriter writer, String crawlingNode){

               try {

                    writer.write(crawlingNode);
            } catch (IOException e) {

                JOptionPane.showMessageDialog(null, "Failed to write URL Node");
                    e.printStackTrace();

            }

           }


           private static void CloseWriter(BufferedWriter writer){
               try {

                    writer.close();

               } catch (IOException e) {

                   JOptionPane.showMessageDialog(null, "Unable to close URL Writer");
                    System.err.println(e);

               }
           }

           public static void main (String[] args) throws IOException{

                ht = new httptest();
                httptest.OpenWriter(file);
                ht.GetLinks();
                httptest.CloseWriter(writer);

        }

    }

score 1 · Accepted Answer

替换字符串不是一种有效的解决方案。相反，我们需要创建另一个字符串并使用该text()方法检索其链接；无论如何，对我有用的代码如下：

    Elements links = doc.getElementsByTag("cite");  
                String crawlingNode = links.html();
                    crawlingNode = crawlingNode.replaceAll("(?=<).*?(>=?)", ""); //Remove undesired html tags

                    for (Element link : links) {

                    String linkText = link.text()+System.lineSeparator();
                    System.out.println(linkText);
                    httptest.WriteOnFile(writer, linkText);
}

score 1 · Accepted Answer

中的行crawlingNode用 unix line-separator 分隔\n。Windows 正在使用\r\n，因此您将无法在例如记事本中看到换行符。您可以使用不同的编辑器或替换分隔符。

crawlingNode.replace("\n", System.getProperty("line.separator"))

score 0 · Accepted Answer

您可能想尝试在其中添加一个 for 语句来一次扫描每个元素。

for(Element link : links)
{
       String crawlingNode = link.html();
       crawlingNode = crawlingNode.replaceAll("(?=<).*?(>=?)", ""); //Remove undesired html tags
       System.out.println(crawlingNode);
       httptest.WriteOnFile(writer, crawlingNode);
}

虽然我不是 100% 确定单个元素适用于 .html() 方法。您将不得不自己尝试。让我知道事情的后续。

java - 如何在 Jsoup 中使用 html() 方法划分行

3 回答 3

Related

Reference