-2

我有一个带有以下标题的巨大 .csv 文件:

时间戳,网址,IP

嵌入在 url 请求中的是需要提取的 Youtube 视频 ID 标识符。

输入

"26 Jul 2013 00:01:01 UTC","http://r2---sn-nwj7km7e.c.youtube.com/videoplayback?algorithm=throttle-factor&burst=40&clen=255192903&cp=U0hWSVhMUV9GTUNONl9QRlVHOlBSTXhMQ2FtRVRy&cpn=lwn6qrn2_oDOCQl_&dur=4259.840&expire=1374813613&factor=1.25&fexp=900223%2C912307%2C911419%2C932217%2C914028%2C916624%2C919515%2C909546%2C929117%2C929121%2C929906%2C929907%2C925720%2C925722%2C925718%2C925714%2C929917%2C929919%2C912521%2C904830%2C919373%2C904122%2C919387%2C936303%2C909549%2C900816%2C936301%2C912711%2C935000&gcr=in&gir=yes&id=10ff11582e78027b&ip=132.93.92.117&ipbits=8&itag=134&keepalive=yes&key=yt1&lmt=1368924664324037&ms=au&mt=1374793074&mv=m&nh=EAI&range=143196160-144138239&ratebypass=yes&signature=78B2B03AFE619C43E61B30AC228B9C33990B2D89.CADEA7BA4F49AF7C0CB9D6A0C7E4EB277AA338F2&source=youtube&sparams=algorithm%2Cburst%2Cclen%2Ccp%2Cdur%2Cfactor%2Cgcr%2Cgir%2Cid%2Cip%2Cipbits%2Citag%2Clmt%2Csource%2Cupn%2Cexpire&sver=3&upn=S4gwbSmbOGM","192.168.101.2",
"26 Jul 2013 00:02:31 UTC","http://www.youtube.com/watch?v=3hSSRHJYHVY",192.168.101.6"
"26 Jul 2013 00:02:34 UTC","http://www.youtube.com/player_204?ei=lrzxUberMOq_kwLnsoGwDQ&plid=AATiXtvkD53nSs3J&fv=WIN%2011,6,602,180&l_ns=1&len=138&l_state=3&fmt=134&lact=1598&slots=sst~0;sidx~0;at~1_3&ad_flags=1&event=ad&cid=7317&el=detailpage&art=2.24&mt=0&fexp=933900,901439,924368,914070,916612,929305,909546,929117,929121,929906,929907,925720,925722,925718,925714,929917,929919,912521,904830,919373,904122,932216,908534,919387,936303,909549,900816,936301,912711,935000&sidx=0&scoville=1&ad_event=3&sst=0&allowed=1_2,1_2_1,1_1,1_3&v=3hSSRHJYHVY&ad_sys=GDFP&rt=1.002&ns=yt&cpn=-gf8Awba9stlT85b&at=1_3&ad_id=16345549","192.168.101.9"
"26 Jul 2013 00:09:02 UTC","http://www.youtube.com/watch?v=e3oP5NtjlEQ","192.168.101.7",

我几乎可以在 bash 中实现这一点,但想在 ruby​​ 中做到这一点(仍在学习)。

cut -d , -f 2 urls.csv | grep watch?v=

输出

"http://www.youtube.com/watch?v=chzEn7TmzJA"
"http://www.youtube.com/watch?v=wAVl_IJV5eI&list=PL34B86ECEC1703D6F"
"http://www.youtube.com/watch?v=8t2s9HSrkl8&list=PL34B86ECEC1703D6F"
"http://www.youtube.com/watch?v=ssdqClUH00c"
"http://www.youtube.com/watch?v=nLIH9cA-Ftg&feature=c4-overview-vl&list=PL1Gpi18n3tsp1GkZ9h4kKKoiJmOSyWpc4"

Youtube 视频 ID 标识符基本上是watch 之后的 11 个字符?=直到第一个&

谢谢。

更新

require 'csv'
require 'addressable/uri'

#read lines from csv, headers on
lines = CSV.readlines("test.csv", :headers=>true)

#print csv column with headers 'Date and Time and 'Url'
#p lines ['Date and Time']
#p lines['Url']
#timestamp = lines ['Date and Time']
urls = lines['Url']

# for each line (url) query value
urls.each do |url|
  v = Addressable::URI.parse(url).query_values["v"]
  if (v)
     puts v # prints value if found
  end
end

上面的代码会输出所有请求中包含的视频ID,而不是专门watch?v=所以有很多重复。

如何让它只输出前缀为watch?v=的视频?(带有时间戳和 IP)。因为这表明视频已实际播放。谢谢。

4

2 回答 2

1

对切​​片和切块 uri 的支持在 ruby​​ 的核心uri类中是有限的。另一种选择是addressable/uri

require 'addressable/uri'
uri=Addressable::URI.parse('http://www.youtube.com/watch?v=nLIH9cA-Ftg&feature=c4-overview-vl&list=PL1Gpi18n3tsp1GkZ9h4kKKoiJmOSyWpc4')
uri.query_values["v"] #query_values returns key-value pairs of query components
=> "nLIH9cA-Ftg"

这是一个片段

urls=["http://www.youtube.com/watch?v=chzEn7TmzJA", "http://www.youtube.com/watch?v=wAVl_IJV5eI&list=PL34B86ECEC1703D6F", "http://www.youtube.com/watch?v=8t2s9HSrkl8&list=PL34B86ECEC1703D6F", "http://www.youtube.com/watch?v=ssdqClUH00c", "http://www.youtube.com/watch?v=nLIH9cA-Ftg&feature=c4-overview-vl&list=PL1Gpi18n3tsp1GkZ9h4kKKoiJmOSyWpc4"]

urls.each do |url|
  v = Addressable::URI.parse(url).query_values["v"]
  puts v
end

退货

chzEn7TmzJA
wAVl_IJV5eI
8t2s9HSrkl8
ssdqClUH00c
nLIH9cA-Ftg

你可以addressable/urisudo gem install addressable

于 2013-08-10T12:32:26.487 回答
0

在 ruby​​ on rails 中:

你可以试试这个:

 require 'csv'
 lines = CSV.readlines("path to csv file)

然后你可以迭代这些行:

lines.each |row| do
 url_parameters = lines[n]  # where n should be the position of column in csv
 uri = URI.parse(url_parameters)
 uri_params = CGI.parse(uri.query)
 video_code = uri_params['v'].first

 # this is the video code of the youtube url : You can do whatever is the requirement

end
于 2013-08-10T11:56:01.950 回答