2

I am looking for a way to extract basic stats (total count, density, count in links, hrefs) for words on an arbitrary website, ideally a Python based solution.

While it is easy to parse a specific website using, say BautifulSoup and determine where the bulk of the content is, it requires you to define the location of the content in the DOM tree ahead of processing. This is easy for, say, hrefs or any arbitraty tag but gets more complicated when determining where the rest of the data (not enclosed in well defined markers) is.

If I understand correctly, robots used by the likes of Google (GoogleBot?) are able to extract data from any website to determine the keyword density. My scenario is similar, obtain the info related to the words that define what the website is about (i.e. after removing js, links and fillers).

My question is, are there any libraries or web APIs that would allow me to get statistics of meaningful words from any given page?

4

2 回答 2

2

There is no APIs but there could be few libraries that you can use it as a tool.

you should count the meaningful words and record them by the time.

you can also Start from something like this:

 string Link= "http://www.website.com/news/Default.asp";
        string itemToSearch= "Word";


        int count = new Regex(itemToSearch).Matches(Link).Count;
        MessageBox.Show(count.ToString());
于 2013-03-30T13:36:39.470 回答
0

There are multiple libraries that deal with more advanced processing of web articles, this question should be a duplicate of this one.

于 2015-07-24T09:20:20.647 回答