java - 我需要用 Java 编写一个程序，在其中打印出网站上的某些内容（如标题），但我需要取出标签

Question

我遇到的主要问题是从网站解析到我的程序。我得到它来打印出源代码。此外，如果它不包含“http://”，我需要添加它。我真的不明白如何解析字符串。

import java.net.*; 
import java.io.*; 
import java.util.Scanner;
public class Project6 { 
  public static void main (String [] args) throws Exception { 

    Scanner sc = new Scanner(System.in); 
    System.out.print("Please enter the URL. "); 
    String web= sc.nextLine(); 
    String foo = "http://allrecipes.com/";


//is "web" have an allrecipes.com url?
//if it doesn't, then exit
if ( web.equals(foo)) {  
  StringBuilder s = new StringBuilder(); 
URL recipes  = new URL (web); 
BufferedReader in = new BufferedReader(new InputStreamReader(recipes.openStream()));

String  inputLine; 

while ((inputLine = in.readLine ())!= null) 
  System.out.println(inputLine);
in.close(); 

}
else { 
   System.out.println("I'm  sorry, but that is not a valid allrecipes.com URL."); 
  System.exit(0); 
//does "web" start with "http://"
//if it doesn't, add it
}

score 1 · Accepted Answer

自己解析 HTML 不是一个好主意。我建议使用jsoup库，它确实有助于解析和选择元素。

使用 jsoup，您的代码可能看起来像这样：

Document doc = Jsoup.connect(web).get();
Elements title = doc.select("title");

它简洁易读，如果需要，您可以轻松解析/选择其他元素（例如，更复杂的 css 选择器，如#recipes > div #recipe-title）

score 0 · Accepted Answer

您正在寻找网络爬虫。只是几个：JSoup 和 Selenium（用于检索元素的 CSS 选择器），crawler4j（我没用过）。

score 0 · Accepted Answer

那么你的 if 条件应该是

if(web.equlas(foo) || web.equlas(foo.replaceAll("http://", "")){


}

如果 web 等于，则上述测试通过

http://allrecipes.com/

或者

allrecipes.com/

作为旁注：http://allrecipes.com/<-- 。最后我猜没有必要/ 。

score 0 · Accepted Answer

匹配来自的输入`foo`：

Scanner sc = new Scanner(System.in);
System.out.print("Please enter the URL. ");
String web = sc.nextLine(); // Suppose "allrecipes.com";
String foo = "http://allrecipes.com"; // no need of / like this http://allrecipes.com/

// is "web" have an allrecipes.com url?
// if it doesn't, then exit
if (foo.matches(web) || foo.matches("http://"+web)) {
 ..........
}

在上述情况下，如果用户已输入allrecipes.com，http://allrecipes.com则只能继续进行

java - 我需要用 Java 编写一个程序，在其中打印出网站上的某些内容（如标题），但我需要取出标签

4 回答 4

匹配来自的输入foo：

Related

Reference

匹配来自的输入`foo`：