Google 支持 robots.txt 中的通配符。robots.txt 中的以下指令将阻止 Googlebot 抓取具有任何参数的任何网页:
Disallow: /*?
这不会阻止许多其他蜘蛛抓取这些 URL,因为通配符不是标准 robots.txt 的一部分。
Google 可能会花时间从搜索索引中删除您已阻止的 URL。额外的 URL 可能仍会被编入索引数月。您可以在网站管理员工具被阻止后使用“删除 URL”功能加快处理速度。但这是一个手动过程,您必须在其中粘贴要删除的每个单独的 URL。
It may also hurt your site's Google rankings to use this robots.txt rule in the case that Googlbot doesn't find the version of the URL without parameters. If you commonly link to the versions with parameters you probably don't want to block them in robots.txt. It would be better to use one of the other options below.
A better option is to use the rel canonical meta tag on each of your pages.
So both your example URLs would have the following in the head section:
<link rel="canonical" href="http://www.site.com/shop/maxi-dress">
That tells Googlebot not to index so many variations of the page, only to index the "canonical" version of the URL that you choose. Unlike using robots.txt, Googlebot will still be able to crawl all your pages and assign value to them, even when they use a variety of URL parameters.
Another option is to log into Google Webmaster Tools and use the "URL Parameters" feature that is in the "Crawl" section.
Once there, click on "Add parameter". You can set "product_type" to "Does not affect page content" so that Google doesn't crawl and index pages with that parameter.
Do the same for each of the parameters that you use that don't change the page.