我正在构建一个 Dockerised 记录播放系统来帮助我记录网站,所以我可以设计一个本地版本而不是真实版本的爬虫。这意味着我不会用自动请求淹没网站,并且具有额外的优势,即我不需要连接到网络即可工作。
我在内部使用了基于 Java 的 WireMock,它使用 Wget 从站点抓取队列中记录。我正在使用WireMock API从它记录的映射中读取各种信息。
但是,我从映射响应中发现域信息似乎没有被记录(除非它在响应标头中意外出现)。请参阅以下回复__admin/mappings
:
{
"result": {
"ok": true,
"list": [
{
"id": "794d609f-99b9-376d-b6b8-04dab161c023",
"uuid": "794d609f-99b9-376d-b6b8-04dab161c023",
"request": {
"url": "/robots.txt",
"method": "GET"
},
"response": {
"status": 404,
"bodyFileName": "body-robots.txt-j9qqJ.txt",
"headers": {
"Server": "nginx/1.0.15",
"Date": "Wed, 04 Jan 2017 21:04:40 GMT",
"Content-Type": "text/html",
"Connection": "keep-alive"
}
}
},
{
"id": "e246fac2-f9ad-3799-b7b7-066941408b8b",
"uuid": "e246fac2-f9ad-3799-b7b7-066941408b8b",
"request": {
"url": "/about/careers/",
"method": "GET"
},
"response": {
"status": 200,
"bodyFileName": "body-about-careers-GhVqy.txt",
"headers": {
"Server": "nginx/1.0.15",
"Date": "Wed, 04 Jan 2017 21:04:35 GMT",
"Content-Type": "text/html",
"Last-Modified": "Wed, 04 Jan 2017 12:52:12 GMT",
"Connection": "keep-alive",
"X-CACHE-URI": "/about/careers/",
"Accept-Ranges": "bytes"
}
}
},
{
"id": "def378f5-a93c-333e-9663-edcd30c936d7",
"uuid": "def378f5-a93c-333e-9663-edcd30c936d7",
"request": {
"url": "/about/careers/feed/",
"method": "GET"
},
"response": {
"status": 200,
"bodyFileName": "body-careers-feed-Fd2fO.xml",
"headers": {
"Server": "nginx/1.0.15",
"Date": "Wed, 04 Jan 2017 21:04:45 GMT",
"Content-Type": "application/rss+xml; charset=UTF-8",
"Transfer-Encoding": "chunked",
"Connection": "keep-alive",
"X-Powered-By": "PHP/5.3.3",
"Vary": "Cookie",
"X-Pingback": "http://www.example.com/xmlrpc.php",
"Last-Modified": "Thu, 06 Jun 2013 14:01:52 GMT",
"ETag": "\"765fc03186b121a764133349f8b716df\"",
"X-Robots-Tag": "noindex, follow",
"Link": "<http://www.example.com/?p=2680>; rel=shortlink",
"X-CACHE-URI": "null cache"
}
}
},
{
"id": "616ca6d7-6e57-4c10-8b57-f6f3dabc0930",
"uuid": "616ca6d7-6e57-4c10-8b57-f6f3dabc0930",
"request": {
"method": "ANY"
},
"response": {
"status": 200,
"proxyBaseUrl": "http://www.example.com"
},
"priority": 10
}
]
}
}
URL 唯一清晰的记录是在最后的条目中proxyBaseUrl
,并且鉴于我必须在控制台调用中指定一个 URL,我现在担心如果我针对不同的域进行记录,每个来自的域都会丢失.
这意味着在播放模式下,WireMock 只能从一个域播放,我必须重新启动它并将其指向另一个缓存才能播放不同的站点。这不适用于我的用例,那么有没有办法解决这个问题?
(我已经使用Mountebank做了一些工作,并且愿意切换到它,尽管我发现 WireMock 通常更易于使用。我对 Mountebank 的有限理解是它遭受相同的单域问题,尽管我很高兴对此进行更正。如果放弃 WireMock 是唯一的前进方式,我很乐意换成任何强大的开源 API 驱动的记录器 HTTP 代理)。