2

I need to store a distinct URL for an external webpage

I need to put the URL into the database. I don't want to store the same page twice so I need to strip all fluff off the URL.

# if I have
url_1 = "http://scientificamerican.com/royal-baby/?utm_campaign=promo"

# and
url_2 = "http://scientificamerican.com/royal-baby/?utm_source=email"

# then they should map to:
url_canonical = "http://scientificamerican.com/royal-baby/"

...it's not as simple as just stripping query parameters though

In order to get a single canonical URL regardless of what was on it I tried stripping the query string. The problem is that there are still CMSs which use the query string.

e.g.

url_1 = "https://www.scientificamerican.com/article.cfm?id=obama-budget"

# strip the query string and it becomes
url_1 = "https://www.scientificamerican.com/article.cfm"

# which is obviously the same for all articles :(

Is there any Rails tool for getting a page's canonical URL?

This is obviously a problem that a number of people have had to solve, not least the search engines. How do you reduce the URL down such that all that remains is the data for the page?

4

1 回答 1

1

你不能。无法知道区分 URL 所需的查询参数。显然,您可以有意删除许多参数(即 utm_campaign 等),但不是全部。

最好的办法是加载页面的 HTML 并查找规范的链接元素。如果存在,那么您就有了规范的 URL。

http://en.wikipedia.org/wiki/Canonical_link_element

于 2013-07-24T16:05:53.093 回答