c++ - Sorting a file with 55K rows and varying Columns

Question

I want to find a programmatic solution using C++.

I have a 900 files each of 27MB size. (just to inform about the enormity ).

Each file has 55K rows and Varying columns. But the header indicates the columns

I want to sort the rows in an order w.r.t to a Column Value.

I wrote the sorting algorithm for this (definitely my newbie attempts, you may say). This algorithm is working for few numbers, but fails for larger numbers.

Here is the code for the same: basic functions I defined to use inside the main code:

int getNumberOfColumns(const string& aline)
{
 int ncols=0;
 istringstream ss(aline);
 string s1;
 while(ss>>s1) ncols++;
 return ncols;
}

vector<string> getWordsFromSentence(const string& aline)
{
 vector<string>words;
 istringstream ss(aline);
 string tstr;
 while(ss>>tstr) words.push_back(tstr);
 return words;
}

bool findColumnName(vector<string> vs, const string& colName)
{
 vector<string>::iterator it = find(vs.begin(), vs.end(), colName);
 if ( it != vs.end()) 
 return true;
 else return false;
}

int getIndexForColumnName(vector<string> vs, const string& colName)
{
 if ( !findColumnName(vs,colName) ) return -1;
 else {
  vector<string>::iterator it = find(vs.begin(), vs.end(), colName);
 return it - vs.begin();
 }
}

////////// I like the Recurssive functions - I tried to create a recursive function
///here. This worked for small values , say 20 rows. But for 55K - core dumps
void sort2D(vector<string>vn, vector<string> &srt, int columnIndex)
{
  vector<double> pVals;
 for ( int i = 0; i < vn.size(); i++) {
  vector<string>meancols = getWordsFromSentence(vn[i]);
  pVals.push_back(stringToDouble(meancols[columnIndex]));
 }

        srt.push_back(vn[max_element(pVals.begin(), pVals.end())-pVals.begin()]);
        if (vn.size() > 1 ) {
        vn.erase(vn.begin()+(max_element(pVals.begin(), pVals.end())-pVals.begin()) );
        vector<string> vn2 = vn;
 //cout<<srt[srt.size() -1 ]<<endl;
        sort2D(vn2 , srt, columnIndex);
        }
}

Now the main code:

 for ( int i = 0; i < TissueNames.size() -1; i++)
 {
  for ( int j = i+1; j < TissueNames.size(); j++)
  {
   //string fname = path+"/gse7307_Female_rma"+TissueNames[i]+"_"+TissueNames[j]+".txt";
   //string fname2 = sortpath2+"/gse7307_Female_rma"+TissueNames[i]+"_"+TissueNames[j]+"Sorted.txt";
   string fname = path+"/gse7307_Male_rma"+TissueNames[i]+"_"+TissueNames[j]+".txt";
   string fname2 = sortpath2+"/gse7307_Male_rma"+TissueNames[i]+"_"+TissueNames[j]+"4Columns.txt";
   vector<string>AllLinesInFile;
   BioInputStream fin(fname);
   string aline;
   getline(fin,aline);
   replace (aline.begin(), aline.end(), '"',' ');
   string headerline = aline;
   vector<string> header = getWordsFromSentence(aline);

   int pindex = getIndexForColumnName(header,"p-raw");
   int xcindex = getIndexForColumnName(header,"xC");
   int xeindex = getIndexForColumnName(header,"xE");
   int prbindex = getIndexForColumnName(header,"X");

   string newheaderline = "X\txC\txE\tp-raw";
   BioOutputStream fsrt(fname2);
   fsrt<<newheaderline<<endl;

   int newpindex=3;
   while ( getline(fin, aline) ){

   replace (aline.begin(), aline.end(), '"',' ');
   istringstream ss2(aline);
   string tstr;
   ss2>>tstr;
   tstr = ss2.str().substr(tstr.length()+1);
   vector<string> words = getWordsFromSentence(tstr);
   string values = words[prbindex]+"\t"+words[xcindex]+"\t"+words[xeindex]+"\t"+words[pindex];
    AllLinesInFile.push_back(values);
   }

   vector<string>SortedLines; 
   sort2D(AllLinesInFile, SortedLines,newpindex);

   for ( int si = 0; si < SortedLines.size(); si++)
    fsrt<<SortedLines[si]<<endl;
   cout<<"["<<i<<","<<j<<"] = "<<SortedLines.size()<<endl;
  }
 }

can some one suggest me a better way of doing this? why it is failing for larger values. ?

The primary function of interest for this query is Sort2D function.

thanks for the time and patience.

prasad.

score 2 · Accepted Answer

我不确定你的代码为什么会崩溃，但在这种情况下递归只会降低代码的可读性。但是，我怀疑这是堆栈溢出，因为您在每次调用中都没有使用太多堆栈空间。

C++ 已经有了std::sort，为什么不使用它呢？你可以这样做：

// functor to compare 2 strings
class CompareStringByValue : public std::binary_function<string, string, bool>
{
public:
    CompareStringByValue(int columnIndex) : idx_(columnIndex) {}
    bool operator()(const string& s1, const string& s2) const
    {
        double val1 = stringToDouble(getWordsFromSentence(s1)[idx_]);
        double val2 = stringToDouble(getWordsFromSentence(s2)[idx_]);
        return val1 < val2;
    }
private:
    int idx_;
};

然后对您的行进行排序，您将调用

std::sort(vn.begin(), vn.end(), CompareByStringValue(columnIndex));

现在，有一个问题。这会很慢，因为stringToDouble并且getWordsFromSentence在同一个字符串上被多次调用。您可能希望生成一个单独的向量，该向量已经预先计算了每个字符串的值，然后CompareByStringValue将该向量用作查找表。

您可以执行此操作的另一种方法是将字符串插入到std::multimap<double, std::string>. 只需插入条目(value, str)，然后逐行读出它们。这更简单但更慢（尽管具有相同的大 O 复杂度）。

编辑：清理了一些不正确的代码并从binary_function.

score 1 · Accepted Answer

您可以尝试不涉及递归的方法。如果您的程序在使用具有较大值的 Sort2D 函数时崩溃，那么您可能会溢出堆栈（使用带有大量函数调用的递归的危险）。尝试另一种排序方法，也许使用循环。

score 0 · Accepted Answer

sort2D崩溃是因为您不断分配要排序的字符串数组，然后按值传递它，实际上使用 O(2*N^2) 内存。如果你真的想保留你的递归函数，只需通过vn引用传递，不要打扰vn2. 如果您不想修改原始的vn，请将的主体移动sort2D到另一个函数（例如sort2Drecursive）中并从sort2D.

一般来说，您可能想再看一下sort2D，因为您正在为应该花费 O(N+N*log(N)) 的东西做 O(N^2) 的工作。

score 0 · Accepted Answer

问题不在于您的代码，而在于您为工作选择的工具。这纯粹是一个文本处理问题，所以选择一个擅长的工具。在这种情况下，在 Unix 上完成这项工作的最佳工具是 Bash 和 GNU coreutils。在 Windows 上，您可以使用 PowerShell、Python 或 Ruby。Python 和 Ruby 也可以在任何 Unix 风格的机器上工作，但几乎所有 Unix 机器都安装了 Bash 和 coreutils。

让我们$FILES保存要处理的文件列表，由空格分隔。这是 Bash 的代码：

for FILE in $FILES; do
  echo "Processing file $FILE ..."
  tail --lines=+1 $FILE |sort >$FILE.tmp
  mv $FILE.tmp $FILE
done

c++ - Sorting a file with 55K rows and varying Columns

4 回答 4

Related

Reference