我检查了几篇关于堆栈溢出的帖子,内容是关于获取所有 html 标签之间的所有单词!他们都把我弄糊涂了!有些人专门为单个标签推荐正则表达式,而有些人则提到了解析技术!我基本上是在尝试制作一个网络爬虫!为此,我得到了我在字符串中获取到我的程序的链接的 html!我还从存储在数据字符串中的 html 中提取了链接!现在我想通过深度爬行并提取我从字符串中提取的所有链接的页面上的单词!我有两个问题!我如何在忽略标签和 java 脚本的情况下获取每个网页上的单词?其次,我将如何递归地爬取链接?
这是在字符串中获取 html 的方式:
public void getting_html_code_of_link()
{
string urlAddress = "http://google.com";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (response.CharacterSet == null)
readStream = new StreamReader(receiveStream);
else
readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
data = readStream.ReadToEnd();
response.Close();
readStream.Close();
Console.WriteLine(data);
}
}
这就是我从我给出的网址中提取链接引用的方式:
public void regex_ka_kaam()
{
StringBuilder sb = new StringBuilder();
//Regex hrefs = new Regex("<a href.*?>");
Regex http = new Regex("http://.*?>");
foreach (Match m in http.Matches(data))
{
sb.Append(m.ToString());
if (http.IsMatch(m.ToString()))
{
sb.Append(http.Match(m.ToString()));
sb.Append(" ");
//sb.Append("<br>");
}
else
{
sb.Append(m.ToString().Substring(1, m.ToString().Length - 1)); //+ "<br>");
}
}
Console.WriteLine(sb);
}