我有一个大的 CSV 文件(1.7GB - 我相信大约 400 万行)。该文件是来自 Cisco IronPort 的某个范围内所有流量的转储。我的最终目标是将文本导入 SQL/Access 或其中一个数据建模应用程序,以便能够显示文件中唯一 ID 的浏览习惯(实际上是 2 个文件)。
导入 SQL 后,它会爆炸,因为其中一个 url 中有一个逗号。我的想法是尝试重写 URL 列以转储 TLD 之后的所有内容(foo.com/blah,tracking?ref=!superuselessstuff 到 foo.com)。
一位同事想出了以下两个 PowerShell 代码。第一个效果很好,但是 1.7G 文件将我的系统拖到了爬行中,并且从未完成(运行了 48 小时未完成)。第二个完成了,但使文本更难处理。帮助?
源数据示例:
"Begin Date"|"End Date"|"Time (GMT -05:00)"|"URL"|"CONTENT TYPE"|"URL CATEGORY"|"DESTINATION IP"|"Disposition"|"Policy Name"|"Policy Type"|"Application Type"|"User"|"User Type"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728377"|"hxxp://mediadownloads.mlb.com/mlbam/2013/06/23/mlbtv_bosdet_28278793_1800K.mp4"|"video/mp4"|"Sports and Recreation"|"165.254.94.168"|"Allow"|"Generics"|"Access"|"Media"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728376"|"hxxp://stats.pandora.com/v1?callback=jQuery17102006296486278092_1374683921429&type=promo_box&action=auto_scroll&source=PromoBoxView&listener_id=84313100&_=1374728377192"|"text/javascript"|"Streaming Audio"|"208.85.40.44"|"Allow"|"Generics"|"Access"|"Media"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728357"|"hxxp://b.scorecardresearch.com/p?c1=1&c2=3005352&c3=&c4=mlb&c5=02&c6=&c10=&c7=hxxp%3A//wapc.mlb.com/det/play/%3Fcontent_id%3D29083605%26topic_id%3D8878748%26c_id%3Ddet&c8=Video%3A%20Recap%3A%20BOS%203%2C%20DET%2010%20%7C%20MLB.com%20Multimedia&c9=hxxp%3A//detroit.tigers.mlb.com/index.jsp%3Fc_id%3Ddet&rn=0.36919005215168&cv=2.0"|"image/gif"|"Business and Industry"|"207.152.125.91"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728356"|"hxxp://lt150.tritondigital.com/lt?guid=VEQyNX4wMmIzY2FmZi1mMmExLTQ5OWQtODM5NS1kMjE0ZTkwMzMyMTY%3D&yob=1978&gender=M&zip=55421&hasads=0&devcat=WEB&devtype=WEB&cb=13747283558794766"|"text/plain"|"Business and Industry"|"208.92.52.90"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\GEN1@Domain 10.XXX.XXX.XXX"|"[-]"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728356"|"""hxxp://an.mlb.com/b/ss/mlbglobal08,mlbtigers/1/H.26/s93606666143392?AQB=1&ndh=1&t=24%2F6%2F2013%2023%3A59%3A17%203%20300&fid=0DDFB0A0676D5241-080519A2C0D076F2&ce=UTF-8&ns=mlb&pageName=Major%20League%20Baseball%3A%20Multimedia%3A%20Video%20Playback%20Page&g=hxxp%3A%2F%2Fwapc.mlb.com%2Fdet%2Fplay%2F%3Fcontent_id%3D29083605%26topic_id%3D8878748%26c_id%3Ddet&cc=USD&events=event2%2Cevent28%2Cevent4&v13=Video%20Playback%20Page&c24=mlbglobal08%2Cmlbtigers&v28=28307515%7CFLASH_1200K_640X360&c49=mlb.mlb.com&v49=mlb.mlb.com&pe=lnk_o&pev1=hxxp%3A%2F%2FmyGenericURL&pev2=VPP%20Game%20Recaps&s=1440x900&c=32&j=1.6&v=Y&k=Y&bw=1440&bh=719&AQE=1"""|"image/gif"|"Sports and Recreation"|"66.235.133.11"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
"2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728356"|"hxxp://ad.auditude.com/adserver/e?type=podprogress&br=4&z=50389&u=e91d539c7acb7daed69ab3fcdb2a4ea0&pod=id%3A4%2Cctype%3Al%2Cptype%3At%2Cdur%3A200%2Clot%3A5%2Cedur%3A0%2Celot%3A0%2Ccpos%3A3&advancepattern=1&l=1374710168&cid=1922976207&event=complete&uid=RzsxnCYcRkiQ6p9YxyRdEQ&s=e9c06908&t=1374728168"|"-"|"Advertisements"|"63.140.50.240"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
第一个消耗资源,但按预期吐出的代码是这样的:
$filename = 'Dump.csv'
$csv = Import-csv $filename -Delimiter '|'
$csv | foreach {
$url = $_.URL
$_.URL = $url -replace '^\"*(\w*)://([^/]*)/(.*)$','$1://$2'
} $csv | Export-Csv 'DumpParsed.csv'
吐出来是这样的:
"Begin Date","End Date","Time (GMT -05:00)","URL","CONTENT TYPE","URL CATEGORY","DESTINATION IP","Disposition","Policy Name","Policy Type","Application Type","User","User Type"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728377","hxxp://mediadownloads.mlb.com","video/mp4","Sports and Recreation","165.254.94.168","Allow","Generics","Access","Media","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728376","hxxp://stats.pandora.com","text/javascript","Streaming Audio","208.85.40.44","Allow","Generics","Access","Media","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728357","hxxp://b.scorecardresearch.com","image/gif","Business and Industry","207.152.125.91","Allow","Generics","Access","-","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728356","hxxp://lt150.tritondigital.com","text/plain","Business and Industry","208.92.52.90","Allow","Generics","Access","-","DOMAIN\GEN1@Domain 10.XXX.XXX.XXX","[-]"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728356","hxxp://an.mlb.com","image/gif","Sports and Recreation","66.235.133.11","Allow","Generics","Access","-","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
"2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728356","hxxp://ad.auditude.com","-","Advertisements","63.140.50.240","Allow","Generics","Access","-","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
第二个代码的运行速度明显更快,但会输出格式错误的数据,这是 SQL 不喜欢的。
$filename = 'Dump.csv'
Import-csv $filename -Delimiter '|' | foreach {
$_.URL = $_.URL -replace '^\"*(\w*)://([^/]*)/(.*)$','$1://$2'
Add-Content 'DumpParsed.csv' "$_"
}
不是那么漂亮的输出:
@{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728377; URL=hxxp://mediadownloads.mlb.com; CONTENT TYPE=video/mp4; URL CATEGORY=Sports and Recreation; DESTINATION IP=165.254.94.168; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=Media; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}
@{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728357; URL=hxxp://b.scorecardresearch.com; CONTENT TYPE=image/gif; URL CATEGORY=Business and Industry; DESTINATION IP=207.152.125.91; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}
@{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728356; URL=hxxp://lt150.tritondigital.com; CONTENT TYPE=text/plain; URL CATEGORY=Business and Industry; DESTINATION IP=208.92.52.90; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\GEN1@Domain 10.XXX.XXX.XXX; User Type=[-]}
@{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728356; URL=hxxp://an.mlb.com; CONTENT TYPE=image/gif; URL CATEGORY=Sports and Recreation; DESTINATION IP=66.235.133.11; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}
@{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728356; URL=hxxp://ad.auditude.com; CONTENT TYPE=-; URL CATEGORY=Advertisements; DESTINATION IP=63.140.50.240; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}
还有其他想法吗?我知道一点powershell,一点点sql。但我对其他任何事情都持开放态度。