我正在努力将 awstats 文件导入数据库。我有大量文件,最大文件大小约为 10MB,大文件中约有 200k 行。文件分为几个部分,其中一个示例如下:
BEGIN_GENERAL 8
LastLine 20150101000000 1379198 369425288 17319453580950
FirstTime 20141201000110
LastTime 20141231235951
LastUpdate 20150101000142 12317 0 12316 0 0
TotalVisits 146425
TotalUnique 87968
MonthHostsKnown 0
MonthHostsUnknown 103864
END_GENERAL
这是一个小数据的小部分。有非常大的部分包含数千行。我在这个项目中使用 Laravel 和 MYSQL,并在表中以 JSON 格式保存部分。这是在数据库中保存文件数据的控制器代码。
<?php
namespace App\Http\Controllers;
use Validator;
use App\Models\Site;
use Illuminate\Http\Request;
use App\Helpers\AwstatsDataParser;
use App\Jobs\ProcessNewSiteStats;
class SiteController extends Controller
{
private $dir_path;
public function __construct(){
$this->dir_path = config('settings.files_path');
}
/**
* Store a newly created resource in storage.
*
* @param \Illuminate\Http\Request $request
* @return \Illuminate\Http\Response
*/
public function store(Request $request)
{
$request->validate([
'title' => 'required|string|',
'domain' => 'required|regex:/(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9]/i|unique:sites,domain',
]);
$site = Site::create([
'title' => $request->title,
'domain' => $request->domain,
'status' => true,
]);
ProcessNewSiteStats::dispatch($site);
return back()->with('success', 'Site is created Successfully');
}
}
该控制器的功能是保存站点,并运行将当前月份文件的数据导入数据库的作业。
namespace App\Jobs;
use Illuminate\Bus\Queueable;
use Illuminate\Queue\SerializesModels;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use App\Models\Site;
use App\Models\Webstat;
use App\Helpers\AwstatsDataParser;
class ProcessNewSiteStats implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
private $site;
private $dir_path;
/**
* Create a new job instance.
*
* @return void
*/
public function __construct(Site $site)
{
$this->site = $site;
$this->dir_path = config('settings.files_path');
}
/**
* Execute the job.
*
* @return void
*/
public function handle()
{
if (is_dir($this->dir_path)) {
$year = date('Y');
$month = date('m');
$fileName = "awstats{$month}{$year}.{$this->site->domain}.txt";
$files_path = "{$this->dir_path}/$fileName";
if (file_exists($files_path)) {
$parser = new awstatsDataParser($files_path);
$time = collect($parser->TIME);
$webstat = Webstat::where('file_name', $fileName)->first();
if(!$webstat){
$data = [
'file_name' => $fileName,
'month' => $month,
'year' => $year,
'total_visits' => $parser->GENERAL['TotalVisits'],
'total_unique' => $parser->GENERAL['TotalUnique'],
'total_hosts_known' => $parser->GENERAL['MonthHostsKnown'],
'total_hosts_unknown' => $parser->GENERAL['MonthHostsUnknown'],
'page_count' => $time->sum('Pages'),
'hit_count' => $time->sum('Hits'),
'bandwidth_count' => $time->sum('Bandwidth'),
'not_viewed_page_count' => $time->sum('NotViewedPages'),
'not_viewed_hit_count' => $time->sum('NotViewedHits'),
'not_viewed_bandwidth_count' => $time->sum('NotViewedBandwidth'),
'general' => $parser->GENERAL,
'time' => $parser->TIME,
'day' => $parser->DAY,
'login' => $parser->LOGIN,
'robot' => $parser->ROBOT,
'worms' => $parser->WORMS,
'email_sender' => $parser->EMAILSENDER,
'email_receiver' => $parser->EMAILRECEIVER,
'sider' => $parser->SIDER,
'domain' => $parser->DOMAIN,
'session' => $parser->SESSION,
'file_types' => $parser->FILETYPES,
'visitor' => $parser->VISITOR,
'downloads' => $parser->DOWNLOADS,
'os' => $parser->OS,
'browser' => $parser->BROWSER,
'screen_size' => $parser->SCREENSIZE,
'unknown_referer' => $parser->UNKNOWNREFERER,
'unknown_referer_browser' => $parser->UNKNOWNREFERERBROWSER,
'origin' => $parser->ORIGIN,
'se_referrals' => $parser->SEREFERRALS,
'page_refs' => $parser->PAGEREFS,
'search_words' => $parser->SEARCHWORDS,
'keywords' => $parser->KEYWORDS,
'misc' => $parser->MISC,
'errors' => $parser->ERRORS,
'cluster' => $parser->CLUSTER,
'sider_404' => $parser->SIDER_404,
'plugin_geoip_city_maxmind' => json_encode($parser->PLUGIN_geoip_city_maxmind, JSON_INVALID_UTF8_SUBSTITUTE),
'is_sync' => true,
];
$webstats = $this->site->webstats()->create($data);
}
}
}
}
}
此代码适用于小文件,但不适用于大数据。通常我会收到关于 MYSQL 服务器消失的错误,max_allocated_package 类型的错误。
我做了以下改进:
- 分段保存数据(如将数据分成三部分,然后保存所需数据,然后通过剩余数据更新行)
- 增加内存大小,执行时间等。
但是我正在寻找一种合适的方法来保存它们,在这些基础上,我必须编写调度程序和一些其他工作,这些工作可以在一个请求中导入许多文件。如果有人对这个问题有好的想法或建议,那就太好了。
谢谢