每个人。
我需要解析一个为每个链接设置了 java cookie 的网页。我可以解析正常的搜索,并显示每个产品并将其导入 mysql 数据库。
我能够使用以下代码从搜索结果中抓取每个产品及其元素:
这就是我所拥有的:
require 'rubygems'
require 'logger'
require 'mechanize'
require 'mysql2'
agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
#agent.set_proxy('a-proxy', '8080')
agent.read_timeout = 60
def add_cookie(agent, uri, cookie)
uri = URI.parse(uri)
Mechanize::Cookie.parse(uri, cookie) do |cookie|
agent.cookie_jar.add(uri, cookie)
end
end
# get main page
page = agent.get "http://www.site.com.mx"
# get login form
form = page.forms.first
form.correo_ingresar = "user"
form.password = "password"
# submit login form
page = agent.submit form
# parse cookies
myarray = page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/)
# set session cookies
myarray.each do |item|
add_cookie(agent, 'http://www.site.com.mx', "#{item[0]}=#{item[1]}; path=/; domain=www.site.com.mx")
end
# show 1000 search results per page
add_cookie(agent, 'http://www.site.com.mx', "tampag=1000; path=/; domain=www.site.com.mx")
# order results
add_cookie(agent, 'http://www.site.com.mx', "orden_articulos=existencias asc; path=/; domain=www.site.com.mx")
# section results
add_cookie (agent, 'http://www.site.com.mx', "codigoseccion_buscar=14; path=/; domain=www.site.com.mx")
# get main page
page = agent.get "http://www.site.com.mx/tienda/index.php"
search_form = page.forms.first
search_result = agent.submit search_form
doc = Nokogiri::HTML(search_result.body)
rows = doc.css("table.articulos tr")
i = 0
details = rows.collect do |row|
detail = {}
[
[:sku, 'td[3]/text()'],
[:desc, 'td[4]/text()'],
[:qty, 'td[5]/text()'],
[:qty2, 'td[5]/p/b/text()'],
[:price, 'td[6]/text()']
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
i = i + 1
detail
end
# walk through paginator links
links = doc.css("a.paginar").map {|l| "http://www.site.com.mx#{l['href']}"}.uniq!
links.each do |l|
page = agent.get l
doc = Nokogiri::HTML(page.body)
rows = doc.css("table.articulos tr")
rows.each do |row|
detail = {}
[
[:sku, 'td[3]/text()'],
[:desc, 'td[4]/text()'],
[:qty, 'td[5]/text()'],
[:qty2, 'td[5]/p/b/text()'],
[:price, 'td[6]/text()']
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
details << detail
end
end
# update db
client = Mysql2::Client.new(:host => "localhost", :username => "myusername", :password => "mypassword", :database => "mydatabase")
details.each do |d|
if d[:sku] != ""
price = d[:price].split
if price[1] == "D"
currency = 144
else
currency = 168
end
cost = price[0].gsub(",", "").to_f
if d[:qty] == ""
qty = d[:qty2]
else
qty = d[:qty]
end
results = client.query("SELECT * FROM jos_vm_product WHERE product_sku = '#{d[:sku]}' LIMIT 1;")
if results.count == 1
product = results.first
client.query("UPDATE jos_vm_product SET product_sku = '#{d[:sku]}', product_name = '#{d[:desc]}', product_desc = '#{d[:desc]}', product_in_stock = '#{qty}' WHERE product_id =
#{product['product_id']};")
client.query("UPDATE jos_vm_product_price SET product_price = '#{cost}', product_currency = '#{currency}' WHERE product_id = '#{product['product_id']}';")
else
client.query("INSERT INTO jos_vm_product(product_sku, product_name, product_desc, product_in_stock) VALUES('#{d[:sku]}', '#{d[:desc]}', '#{d[:desc]}', '#{qty}');")
last_id = client.last_id
client.query("INSERT INTO jos_vm_product_price(product_id, product_price, product_currency) VALUES('#{last_id}', '#{cost}', #{currency});")
end
end
end
现在我不想搜索我想从类别列表中解析:
链接到主页:http://www.site.com.mx/tienda/articulos.php?opcion=lineas&seccion_mostrar=11 这显示了这样的表格(所有内容都包含链接) 顶部名称: ACCESORIOS 是 ACCESORIOS 类别的链接,下面列出的粗体名称是子类别,粗体名称下面的是品牌。如果我点击 ACCESORIOS,它将显示每个品牌和每个子类别混合在一起,依此类推。
ACCESORIOS
Accesorios Multimedia(6)
ACTECK DE MEXICO (5), MANHATTAN (1)
Accesorios P/impres。Punto De Venta(1)
EPSON CORPORATION (1)
Accesorios Para Cableados De Patch Panels(1) 智能
网络解决方案 (1)
Accesorios Para Camaras Digitales(1) 曼哈顿
(1)
Accesorios Para Computadoras De Escritorio(32)
ACTECK DE MEXICO ), GENERICA (1), MANHATTAN (28), TARGUS (1)
Accesorios Para Computadoras Portatiles(60)
ACTECK DE MEXICO (3), GENIUS (2), HP COMERCIAL (2), HP IMPRESION (1), MANHATTAN (17) ), PERFECT CHOICES (32), SOLIDEX (1), TARGUS (1), TECH ZONE (1)
Accesorios Para Ipod(3)
ACTECK DE MEXICO (1), PERFECT CHOICES (2)
Accesorios Para Mesas(3)
MANHATTAN (2), PERFECT CHOICES (1)
Accesorios Para Redes(13)
INTELLINET NETWORK SOLUTIONS (5), MANHATTAN (8)
Accesoriso Para Celulares(14)
BLACKBERRY (14)
Adaptador Bluetooth(6)
ACTECK DE MEXICO (1), MANHATTAN (2), PERFECT CHOICES (3)
Adaptadores Para Mouse Y Teclado(3)
MANHATTAN (2), PERFECT CHOICES (1)
Audifono/diademas Y Microfonos(49)
ACTECK DE MEXICO (14), BTO (1), GENIUS (3) )、罗技 (2)、曼哈顿 (11)、完美选择 (18)
这是每个链接都有 cookie 的表的代码,这就是为什么我一直很难抓取它。
<table width="95%" cellspacing="0" cellpadding="3" border="0">
<tbody>
<tr>
<td valign="top" align="left" style="font-family: verdana; font-size: 12px" colspan="2"><a onClick="fijar_filtro('codigoseccion_buscar','11')" href="javascript:void(0)" class="busquedas"><b>ACCESORIOS</b></a></td>
</tr>
<tr>
<td width="20" valign="top" align="left"></td>
<td valign="top" align="left" style="font-family: verdana; font-size: 12px"><a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','338')" href="javascript:void(0)" class="busquedas"><b>Accesorios Multimedia</b>(6)</a><br>
<a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (5)</a>, <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','540')" href="javascript:void(0)" class="busquedas"><b>Accesorios P/impres. Punto De Venta</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','540');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','106');" href="javascript:void(0)" class="busquedas">EPSON CORPORATION (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','542');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','361')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Camaras Digitales</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','361');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','277')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras De Escritorio</b>(32)</a><br>
<a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (2)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','530');" href="javascript:void(0)" class="busquedas">GENERICA (1)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (28)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','357')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras Portatiles</b>(60)</a><br>
<a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (3)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','167');" href="javascript:void(0)" class="busquedas">GENIUS (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','694');" href="javascript:void(0)" class="busquedas">HP COMERCIAL (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','107');" href="javascript:void(0)" class="busquedas">HP IMPRESION (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (17)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (32)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','212');" href="javascript:void(0)" class="busquedas">SOLIDEX (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','691');" href="javascript:void(0)" class="busquedas">TECH ZONE (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1302')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Ipod</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (2)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1175')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Mesas</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','292')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Redes</b>(13)</a><br>
<a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (5)</a>, <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (8)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1378')" href="javascript:void(0)" class="busquedas"><b>Accesoriso Para Celulares</b>(14)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1378');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','714');" href="javascript:void(0)" class="busquedas">BLACKBERRY (14)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1313')" href="javascript:void(0)" class="busquedas"><b>Adaptador Bluetooth</b>(6)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (3)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','555')" href="javascript:void(0)" class="busquedas"><b>Adaptadores Para Mouse Y Teclado</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
</td>
</tr>
</tbody>
</table>
所以问题是我应该在我的代码中添加什么才能访问每个链接?如果它使用 java cookie。
使用的 Cookie:
名称、值范围
codigoseccion_buscar、11-30
codigomarca_buscar、100-736
codigolinea_buscar、15-1385