c - 对于大型输入集，如何使 C 中的这个排序程序更快

Question

对于非常大的输入文件数据，此排序代码会失败，因为它需要很长时间才能完成。

rewind(ptr);
j=0;
while(( fread(&temp,sizeof(temp),1,ptr)==1) &&( j!=lines-1)) //read object by object
{
  i=j+1;
  while(fread(&temp1,sizeof(temp),1,ptr)==1)  //read next object , to compare previous object with next object 
   {
       if(temp.key > temp1.key)   //compare key value of object 
           {
            temp2=temp; //if you don't want to change records and just want to change keys use three statements temp2.key =temp.key;
            temp=temp1;
            temp1=temp2;
            fseek(ptr,j*sizeof(temp),0);        //move stream to overwrite 
            fwrite(&temp,sizeof(temp),1,ptr);   //you can avoid above swap by changing &temp to &temp1 
            fseek(ptr,i*sizeof(temp),0);        //move stream to overwrite
            fwrite(&temp1,sizeof(temp),1,ptr);  //you can avoid above swap by changing &temp1 to &temp
           }
    i++; 
   }
  j++; 
  fseek(ptr,j*sizeof(temp),0);  
}

关于如何使这个 C 代码更快的任何想法？使用qsort()（在 C 中预定义）也会快得多，应该如何应用于上述代码？

score 4 · Accepted Answer

You asked the question Sorting based on key from a file and were given various answers about how to sort in memory. You added a supplemental question as an answer, and then created this question instead (which was correct).

Your code here is basically a disk-based bubble sort, with O(N²) complexity, and poor time performance because it is manipulating file buffers and disk. A bubble sort is a bad choice at the best of times — simple, yes, but slow.

The basic ways to speed up sorting programs are:

If possible, read all the data into memory, sort in memory, and write the result out.
If it won't all fit into memory, read as much into memory as possible, sort it, and write the sorted data to a temporary file. Repeat as often as necessary to sort all the data. Then merge the temporary files into one file. If the data set is truly astronomical (or the memory truly minuscule), you may have to create intermediate merge files. These days, though, you have to be sorting many hundreds of gigabytes for that to be an issue at all, even on a 32-bit computer.
Make sure you choose a good sorting algorithm. Quick sort with appropriate pivot selection is very good. You could look up 'introsort' too.

You'll find example in-memory sorting code in the answers to the cross-referenced question (your original question). If you choose to write your own sort, you can consider whether to base the interface on the standard C qsort() function. If you write a Quick Sort, you should look at Quicksort — Choosing the pivot where the answers have copious references.

You'll find example merging code in the answer to Merging multiple sorted files into one file. The merging code out-performs the system sort program in its merge mode, which is intriguing since it is not highly polished code (but it is reasonably workmanlike).

You could look at the external sort program described in Software Tools, though it is a bit esoteric in that it is written in 'RatFor' or Rational Fortran. The design, though, is readily transferrable to other languages.

score 2 · Accepted Answer

是的，无论如何，使用 qsort()。按照 SpiderPig 的建议，通过将整个文件读入内存来使用它，或者作为内存中的排序，用于运行适合内存的运行，为合并排序做准备。不要担心最坏情况下的性能。一个体面的实现将采用（第一、最后、中间）的中值来对已经排序和倒序的病理情况进行快速排序，并在随机情况下获得更好的平均性能。

这个全内存示例向您展示了如何使用 qsort：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

typedef struct record_tag
{
    int     key;
    char    data[12];

} record_type, *record_ptr;
const record_type * record_cptr;

void create_file(const char *filename, int n)
{
    record_type buf;
    int i;
    FILE *fptr = fopen(filename, "wb");
    for (i=0; i<n; ++i)
    {
        buf.key = rand();
        snprintf(buf.data, sizeof buf.data, "%d", buf.key);
        fwrite(&buf, sizeof buf, 1, fptr);
    }
    fclose(fptr);
}

/* Key comparison function used by qsort(): */

int compare_records(const void *x, const void *y)
{
    const record_ptr a=(const record_ptr)x;
    const record_ptr b=(const record_ptr)y;
    return (a->key > b->key) - (a->key < b->key);
}

/* Read an input file of (record_type) records, sort by key field, and write to the output file */

void sort_file(const char *ifname, const char *ofname)
{
    const size_t MAXREC = 10000;
    int n;
    FILE    *ifile, *ofile;
    record_ptr buffer;

    ifile = fopen(ifname, "rb");
    buffer = (record_ptr) malloc(MAXREC*sizeof *buffer);
    n = fread(buffer, sizeof *buffer, MAXREC, ifile);
    fclose(ifile);

    qsort(buffer, n, sizeof *buffer, compare_records);

    ofile = fopen(ofname, "wb");
    fwrite(buffer, sizeof *buffer, n, ofile);
    fclose(ofile);
}

void show_file(const char *fname)
{
    record_type buf;
    int n = 0;
    FILE *fptr = fopen(fname, "rb");
    while (1 == fread(&buf, sizeof buf, 1, fptr))
    {
        printf("%9d : %-12s\n", buf.key, buf.data);
        ++n;
    }
    printf("%d records read", n);
}

int main(void)
{
    srand(time(NULL));

    create_file("test.dat", 99);
    sort_file("test.dat", "test.out");
    show_file("test.out");

    return 0;
}

注意 compare_records 函数。qsort() 函数需要一个接受 void 指针的函数，因此必须将这些指针转换为正确的类型。然后是模式：

(left > right) - (left < right)

...如果左参数更大，则返回 1，如果它们相等，则返回 0，如果右参数更大，则返回 -1。

可以改进的。首先，绝对没有错误检查。这在生产代码中是不明智的。其次，您可以检查输入文件以获取文件大小，而不是猜测它小于某个 MAXxxx 值。一种方法是使用ftell。（按照链接获取文件大小示例。）然后，使用该值分配一个缓冲区，该缓冲区的大小足以对数据进行 qsort 排序。

如果没有足够的空间（如果 malloc 返回 NULL），那么您可以退回到对适合内存的块进行排序（使用 qsort，如代码片段中所示），将它们写入单独的临时文件，然后将它们合并为单个输出文件。这更复杂，而且很少这样做，因为有专门为对大文件进行排序而设计的排序/合并实用程序。

c - 对于大型输入集，如何使 C 中的这个排序程序更快

2 回答 2

Related

Reference