我们有 200Gb 的文件,其中填充了除以“\n”的文本行。我们的服务器有板载 Linux、gcc、8GB RAM 和无限的硬盘空间。从我们的角度来看,要求是在 C++ 中实现对这个文件的行进行字典排序的最有效方法。
我已经正确地为输入文件(最大 2GB)按字典顺序对文本行进行了外部排序。但是当我使用大于 2GB 的文件进行测试时,输出不正确。最大文件大小需要测试大于 200GB。
下面是我的代码,它适用于小于 2GB 的文件大小。程序有 3 个参数作为命令行参数:输入文件名、输出文件名和内存限制(以字节为单位)(我们将使用不同的内存限制进行测试,而不仅仅是 8GB) 程序应该能够在简单的边界情况下正常工作,例如小文件、空文件、大于 200GB 的文件。它应该适用于非 ascii 符号,但我们可以假设文件中不存在零字节。我们还可以假设内存限制远大于最长行的大小。
这是代码。请帮我指出文件大小大于 2GB 时无法产生正确结果的原因。如果您能给我下面的代码的修改版本以使其适用于所有情况,我将不胜感激。
#include <stdio.h>
#include <stdlib.h>
#include <fstream>
#include <sstream>
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <iterator>
//using namespace std;
#define FILESIZE 200000000000 // size of the file on disk
#define TOTAL_MEM 7000000000 // max items the memory buffer can hold
typedef std::vector<std::string> DataContainer;
// Function to sort lines of text according to their length
/*bool compare(std::string &a, std::string &b)
{
if (a.size() < b.size())
return true;
else if (a.size() > b.size())
return false;
else
return a < b;
//return strcasecmp( a.c_str(), b.c_str() ) <= 0;
}
long estimateSizeOfBlocks(long sizeoffile, int maxtmpfiles, long maxMemory)
{
}
*/
std::string ToString(int val) {
std::stringstream stream;
stream << val;
return stream.str();
}
void ExternalSort(std::string infilepath, std::string outfilepath)
{
std::string line, new_line;
std::string tfile = "temp-file-";
int i, j, k;
DataContainer v;
std::ifstream infile;
if(!infile.is_open())
{
std::cout << "Unable to open file\n";
}
infile.open(infilepath, std::ifstream::in);
// Check if input file is empty
if (infile.peek() == std::ifstream::traits_type::eof())
{
std::cout << "File is empty";
}
//std::vector<std::string> x(std::istream_iterator<std::string>(infile), {});
// Get size of the file
infile.seekg (0, infile.end);
int input_file_size = infile.tellg();
infile.seekg (0, infile.beg);
std::cout << "The size of the file chosen is (in bytes): " << input_file_size << "\n";
// Estimate required number of runs to read all data in the input file
int runs_count;
if(input_file_size % TOTAL_MEM > 0)
runs_count = input_file_size/TOTAL_MEM + 1; // if file size is fixed at 200GB, we just need to define FILESIZE: runs_count = FILESIZE/TOTAL_MEM
else
runs_count = input_file_size/TOTAL_MEM;
// Iterate through the elements in the file
for(i = 0; i < runs_count; i++)
{
// Step 1: Read M-element chunk at a time from the file
for (j = 0; j < (TOTAL_MEM < input_file_size ? TOTAL_MEM : input_file_size); j++)
{
while(std::getline(infile, line))
{
// If line is empty, ignore it
if(line.empty())
continue;
new_line = line + "\n";
// Line contains string of length > 0 then save it in vector
if(new_line.size() > 0)
v.push_back(new_line);
}
}
// Step 2: Sort M elements
sort(v.begin(), v.end()); //sort(v.begin(), v.end(), compare);
// Step 3: Create temporary files and write sorted data into those files.
std::ofstream tf;
tf.open(tfile + ToString(i) + ".txt", std::ofstream::out | std::ofstream::app);
std::ostream_iterator<std::string> output_iterator(tf, "\n");
std::copy(v.begin(), v.end(), output_iterator);
/* for(vector<std::string>::const_iterator it = v.begin(); it != v.end(); ++it)
tf << *it << "\n";
*/
tf.close();
}
infile.close();
// Step 4: Now open each file and merge them, then write back to disk
DataContainer vn;
for(i = 0; i < runs_count; i++)
{
std::ifstream sortedFiles[runs_count];
std::ostringstream filename;
filename << tfile << i << ".txt";
sortedFiles[i].open(filename.str(), std::ifstream::in);
//sortedFiles[i].open(tfile + ToString(i) + ".txt", std::ifstream::in);
std::string st, new_st;
while(std::getline(sortedFiles[i], st))
{
// If line is empty, ignore it
if(st.empty())
continue;
new_st = st + "\n";
if(new_st.size() > 0)
vn.push_back(new_st);
}
sortedFiles[i].close();
}
// Step 5: Merge total sorted data
sort(vn.begin(), vn.end());
std::ofstream outfile;
outfile.open(outfilepath, std::ofstream::out | std::ofstream::app);
std::ostream_iterator<std::string> output_iterator(outfile, "\n");
std::copy(vn.begin(), vn.end(), output_iterator);
outfile.close(); // Close final sorted file
}
int main(int argc, char** argv) // e.g. argv[1] = "E:\\data.txt" argv[2] = "E:\\sorted_data.txt"
{
std::cout << "You have entered " << argc
<< " arguments:\n";
for (int i = 0; i < argc; ++i)
std::cout << argv[i] << "\n";
ExternalSort(argv[1], argv[2]);
return 0;
}