现在我有一个 url 列表,我想找回所有的网页。这是我所做的:
for each url:
getaddrinfo(hostname, port, &hints, &res); // DNS
// create socket
sockfd = socket(res->ai_family, res->ai_socktype, res->ai_protocol);
connect(sockfd, res->ai_addr, res->ai_addrlen);
creatGET();
/* for example:
GET / HTTP/1.1\r\n
Host: stackoverflow.cn\r\n
...
*/
writeHead(); // send GET head to host
recv(); // get the webpage content
end
我注意到许多网址都在同一主机下,例如:
http://job.01hr.com/j/f-6164230.html
http://job.01hr.com/j/f-6184336.html
http://www.012yy.com/gangtaiju/32692/
http://www.012yy.com/gangtaiju/35162/
所以我想知道,我可以connect
对每个主机只访问一次,然后creatGET()
对writeHead()
每个recv()
url 访问一次吗?这可能会节省很多时间。所以我像这样改变了我的程序:
split url into groups by their host;
for each group:
get hostname in the group;
getaddrinfo(hostname, port, &hints, &res);
sockfd = socket(res->ai_family, res->ai_socktype, res->ai_protocol);
connect(sockfd, res->ai_addr, res->ai_addrlen);
for each url in the group:
creatGET();
writeHead();
recv();
end
end
不幸的是,我发现我的程序只能返回每个组中的第一个网页,其余的都返回空文件。我错过了什么吗?也许每个 recv() 都sockfd
需要某种东西?reset
谢谢你的慷慨帮助。