1

我有一个大的 CSV 文件(1.7GB - 我相信大约 400 万行)。该文件是来自 Cisco IronPort 的某个范围内所有流量的转储。我的最终目标是将文本导入 SQL/Access 或其中一个数据建模应用程序,以便能够显示文件中唯一 ID 的浏览习惯(实际上是 2 个文件)。

导入 SQL 后,它会爆炸,因为其中一个 url 中有一个逗号。我的想法是尝试重写 URL 列以转储 TLD 之后的所有内容(foo.com/blah,tracking?ref=!superuselessstuff 到 foo.com)。

一位同事想出了以下两个 PowerShell 代码。第一个效果很好,但是 1.7G 文件将我的系统拖到了爬行中,并且从未完成(运行了 48 小时未完成)。第二个完成了,但使文本更难处理。帮助?

源数据示例:

 "Begin Date"|"End Date"|"Time (GMT -05:00)"|"URL"|"CONTENT TYPE"|"URL CATEGORY"|"DESTINATION IP"|"Disposition"|"Policy Name"|"Policy Type"|"Application Type"|"User"|"User Type"
 "2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728377"|"hxxp://mediadownloads.mlb.com/mlbam/2013/06/23/mlbtv_bosdet_28278793_1800K.mp4"|"video/mp4"|"Sports and Recreation"|"165.254.94.168"|"Allow"|"Generics"|"Access"|"Media"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
 "2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728376"|"hxxp://stats.pandora.com/v1?callback=jQuery17102006296486278092_1374683921429&type=promo_box&action=auto_scroll&source=PromoBoxView&listener_id=84313100&_=1374728377192"|"text/javascript"|"Streaming Audio"|"208.85.40.44"|"Allow"|"Generics"|"Access"|"Media"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
 "2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728357"|"hxxp://b.scorecardresearch.com/p?c1=1&c2=3005352&c3=&c4=mlb&c5=02&c6=&c10=&c7=hxxp%3A//wapc.mlb.com/det/play/%3Fcontent_id%3D29083605%26topic_id%3D8878748%26c_id%3Ddet&c8=Video%3A%20Recap%3A%20BOS%203%2C%20DET%2010%20%7C%20MLB.com%20Multimedia&c9=hxxp%3A//detroit.tigers.mlb.com/index.jsp%3Fc_id%3Ddet&rn=0.36919005215168&cv=2.0"|"image/gif"|"Business and Industry"|"207.152.125.91"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
 "2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728356"|"hxxp://lt150.tritondigital.com/lt?guid=VEQyNX4wMmIzY2FmZi1mMmExLTQ5OWQtODM5NS1kMjE0ZTkwMzMyMTY%3D&yob=1978&gender=M&zip=55421&hasads=0&devcat=WEB&devtype=WEB&cb=13747283558794766"|"text/plain"|"Business and Industry"|"208.92.52.90"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\GEN1@Domain 10.XXX.XXX.XXX"|"[-]"
 "2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728356"|"""hxxp://an.mlb.com/b/ss/mlbglobal08,mlbtigers/1/H.26/s93606666143392?AQB=1&ndh=1&t=24%2F6%2F2013%2023%3A59%3A17%203%20300&fid=0DDFB0A0676D5241-080519A2C0D076F2&ce=UTF-8&ns=mlb&pageName=Major%20League%20Baseball%3A%20Multimedia%3A%20Video%20Playback%20Page&g=hxxp%3A%2F%2Fwapc.mlb.com%2Fdet%2Fplay%2F%3Fcontent_id%3D29083605%26topic_id%3D8878748%26c_id%3Ddet&cc=USD&events=event2%2Cevent28%2Cevent4&v13=Video%20Playback%20Page&c24=mlbglobal08%2Cmlbtigers&v28=28307515%7CFLASH_1200K_640X360&c49=mlb.mlb.com&v49=mlb.mlb.com&pe=lnk_o&pev1=hxxp%3A%2F%2FmyGenericURL&pev2=VPP%20Game%20Recaps&s=1440x900&c=32&j=1.6&v=Y&k=Y&bw=1440&bh=719&AQE=1"""|"image/gif"|"Sports and Recreation"|"66.235.133.11"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"
 "2013-07-24 05:00 GMT"|"2013-07-25 04:59 GMT"|"1374728356"|"hxxp://ad.auditude.com/adserver/e?type=podprogress&br=4&z=50389&u=e91d539c7acb7daed69ab3fcdb2a4ea0&pod=id%3A4%2Cctype%3Al%2Cptype%3At%2Cdur%3A200%2Clot%3A5%2Cedur%3A0%2Celot%3A0%2Ccpos%3A3&advancepattern=1&l=1374710168&cid=1922976207&event=complete&uid=RzsxnCYcRkiQ6p9YxyRdEQ&s=e9c06908&t=1374728168"|"-"|"Advertisements"|"63.140.50.240"|"Allow"|"Generics"|"Access"|"-"|"DOMAIN\gen7@Domain 10.XXX.XXX.XXX"|"[-]"

第一个消耗资源,但按预期吐出的代码是这样的:

 $filename = 'Dump.csv'
 $csv = Import-csv $filename -Delimiter '|'
 $csv | foreach {
 $url = $_.URL
 $_.URL = $url -replace '^\"*(\w*)://([^/]*)/(.*)$','$1://$2'
 }  $csv | Export-Csv 'DumpParsed.csv'

吐出来是这样的:

 "Begin Date","End Date","Time (GMT -05:00)","URL","CONTENT TYPE","URL CATEGORY","DESTINATION IP","Disposition","Policy Name","Policy Type","Application Type","User","User Type"
 "2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728377","hxxp://mediadownloads.mlb.com","video/mp4","Sports and Recreation","165.254.94.168","Allow","Generics","Access","Media","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
 "2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728376","hxxp://stats.pandora.com","text/javascript","Streaming Audio","208.85.40.44","Allow","Generics","Access","Media","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
 "2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728357","hxxp://b.scorecardresearch.com","image/gif","Business and Industry","207.152.125.91","Allow","Generics","Access","-","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
 "2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728356","hxxp://lt150.tritondigital.com","text/plain","Business and Industry","208.92.52.90","Allow","Generics","Access","-","DOMAIN\GEN1@Domain 10.XXX.XXX.XXX","[-]"
 "2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728356","hxxp://an.mlb.com","image/gif","Sports and Recreation","66.235.133.11","Allow","Generics","Access","-","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"
 "2013-07-24 05:00 GMT","2013-07-25 04:59 GMT","1374728356","hxxp://ad.auditude.com","-","Advertisements","63.140.50.240","Allow","Generics","Access","-","DOMAIN\gen7@Domain 10.XXX.XXX.XXX","[-]"

第二个代码的运行速度明显更快,但会输出格式错误的数据,这是 SQL 不喜欢的。

 $filename = 'Dump.csv'
 Import-csv $filename -Delimiter '|' | foreach {
 $_.URL = $_.URL -replace '^\"*(\w*)://([^/]*)/(.*)$','$1://$2'
 Add-Content 'DumpParsed.csv' "$_"
 }  

不是那么漂亮的输出:

 @{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728377; URL=hxxp://mediadownloads.mlb.com; CONTENT TYPE=video/mp4; URL CATEGORY=Sports and Recreation; DESTINATION IP=165.254.94.168; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=Media; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}
 @{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728357; URL=hxxp://b.scorecardresearch.com; CONTENT TYPE=image/gif; URL CATEGORY=Business and Industry; DESTINATION IP=207.152.125.91; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}
 @{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728356; URL=hxxp://lt150.tritondigital.com; CONTENT TYPE=text/plain; URL CATEGORY=Business and Industry; DESTINATION IP=208.92.52.90; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\GEN1@Domain 10.XXX.XXX.XXX; User Type=[-]}
 @{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728356; URL=hxxp://an.mlb.com; CONTENT TYPE=image/gif; URL CATEGORY=Sports and Recreation; DESTINATION IP=66.235.133.11; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}
 @{Begin Date=2013-07-24 05:00 GMT; End Date=2013-07-25 04:59 GMT; Time (GMT -05:00)=1374728356; URL=hxxp://ad.auditude.com; CONTENT TYPE=-; URL CATEGORY=Advertisements; DESTINATION IP=63.140.50.240; Disposition=Allow; Policy Name=Generics; Policy Type=Access; Application Type=-; User=DOMAIN\gen7@Domain 10.XXX.XXX.XXX; User Type=[-]}

还有其他想法吗?我知道一点powershell,一点点sql。但我对其他任何事情都持开放态度。

4

4 回答 4

1

您的第二个解决方案工作得更快,因为它不会将所有文件都放在内存中。您可以尝试像这样更改它:

 $filename = 'Dump.csv'
 Import-csv $filename -Delimiter '|' | foreach { $_.URL = $_.URL -replace '^\"*(\w*)://([^/]*)/(.*)$','$1://$2'; $_ } |export-csv 'DumpParsed.csv'
于 2013-09-03T04:50:57.207 回答
1

首先,如果你这样做:

$csv = Import-csv $filename -Delimiter '|'

您将整个文件作为从字段构造的对象加载到内存中。因此,内存消耗和性能不足为奇。第二种方法还不错,但应该以 CSV 格式输出。就目前而言,它转储了它创建的对象的内容。你可以试试这个:

$filename = 'Dump.csv'
Import-csv $filename -Delimiter '|' | 
    Foreach {$_.URL = $_.URL -replace '^\"*(\w*)://([^/]*)/(.*)$','$1://$2'} |
    ConvertTo-Csv -NoTypeInfo | Out-File DumpParsed.csv -Enc UTF8 -Append

顺便说一句,看看跳过 CSV 处理是否会显着加快速度会很有趣,例如:

Get-Content $filename | Foreach {$_ -replace '\"*(\w*)://([^/]*)/[^"]*"(.*)','$1://$2"$3'} |
    Out-File DumpParsed.csv -Enc UTF8

我只是在猜测日志文件的原始编码。它很可能是ASCII。

于 2013-09-03T04:52:21.983 回答
1

您是否尝试过使用流编写器进行输出?而不是像 csv 那样导入文件,而是逐行浏览它?像这样的东西:

$filename = "Dump.csv"
$out      = "C:\path\to\out-file.csv" # full path required here

$stream = [System.IO.StreamWriter] $out

Get-Content $filename `
    | % {
        $line = $_ -replace '\"+(\w*)://([^/]*)/(.*?)\"+','"$1://$2"'
        $stream.WriteLine($line)
    }

$stream.close()

如果要导入 SQL Server,则可以将 TextQualified 字段设置为 true,它将引号内的所有内容视为字符串,包括额外的逗号。

于 2013-09-03T09:24:10.440 回答
0

如果您的数据库导入在逗号上阻塞,是否可以选择仅替换该逗号?像这样:

Get-Content 'Dump.csv' | % { $_ -replace ',','%2C' } | Out-File 'DumpParsed.csv'

或者像这样(如果其他字段包含您要保留的文字逗号):

Import-Csv 'Dump.csv' -Delimiter '|' `
  | % { $_.URL = $_.URL -replace ',','%2C' } `
  | Export-Csv 'DumpParsed.csv' -Delimiter '|'

%2C是 URL 中逗号的编码。

于 2013-09-03T09:33:29.590 回答