c - LZW encoding for large file

Question

I am building an LZW encoding algorithm, which uses dictionary and hashing so it can reach fast enough for working words already stored in a dictionary.

The algorithm gives proper results when ran on smaller files (cca few hundreds of symbols), but on the larger files (and especially in those files which contain of less different symbols - for example, it gives the worst performance when ran on a file which consists only of 1 symbol, 'y' let's say). The worst performance, in terms that it just crashes when dictionary is not even close to being full. However, when the large input file consists of more than 1 symbol, dictionary gets close to being full, approximately 90%, but again then it crashes.

Considering the structure of my algorithm, I am not quite sure what is causing it to crash in general, or crash so soon when large file of just 1 symbol is given. It must be something about hashing (first time doing it, so it might have some bugs).

The hash function I am using can be found here, and from what I have tested it, it gives good results: oat_hash

LZW encoding algorithm is based on this link, with slight change, that it works until the dictionary is not full: LZW encoder

Let's get into code:

Note: oat_hash is changed so it returns value % CAPACITY, so every index is from DICTIONARY

    // Globals
#define CAPACITY 100000
char *DICTIONARY[CAPACITY];
unsigned short CODES[CAPACITY]; // CODES and DICTIONARY are linked via index: word from dictionary on index i, has its code in CODES on index i
int position = 0;
int code_counter = 0;

void encode(FILE *input, FILE *output){

int succ1 = fseek(input, 0, SEEK_SET);
if(succ1 != 0) printf("Error: file not open!");

int succ2 = fseek(output, 0, SEEK_SET);
if(succ2 != 0) printf("Error: file not open!");

//1. Working word = next symbol from the input
char *working_word = malloc(2048*sizeof(char));
char new_symbol = getc(input);
working_word[0] = new_symbol;
working_word[1] = '\0';



//2. WHILE(there are more symbols on the input) DO
//3. NewSymbol = next symbol from the input
while((new_symbol = getc(input)) != EOF){

    char *workingWord_and_newSymbol= NULL;
    char newSymbol[2];
    newSymbol[0] = new_symbol;
    newSymbol[1] = '\0';

    workingWord_and_newSymbol = working_word_and_new_symbol(working_word, newSymbol);

    int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));

    //4. IF(WorkingWord + NewSymbol) is already in the dictionary THEN
    if(DICTIONARY[index] != NULL){
        // 5. WorkingWord += NewSymbol
        working_word = working_word_and_new_symbol(working_word, newSymbol);

    }
    //6. ELSE
    else{
        //7. OUTPUT: code for WorkingWord
        int idx = oat_hash(working_word, strlen(working_word));

        fprintf(output, "%u", CODES[idx]);

        //8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
        if(!dictionary_full()){
            DICTIONARY[index] = workingWord_and_newSymbol;
            CODES[index] = code_counter + 1;
            code_counter += 1;
            working_word = strdup(newSymbol);
        }else break;

    }
    //10. END IF
}
//11. END WHILE

//12. OUTPUT: code for WorkingWord
int index = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[index]);

free(working_word);

}

score 1 · Accepted Answer

 int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));

然后

    int idx = oat_hash(working_word, strlen(working_word));

    fprintf(output, "%u", CODES[idx]);

    //8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
    if(!dictionary_full()){
        DICTIONARY[index] = workingWord_and_newSymbol;
        CODES[index] = code_counter + 1;
        code_counter += 1;
        working_word = strdup(newSymbol);
    }else break;

idx 和 index 是无界的，您可以使用它们来访问有界数组。您正在访问超出范围的内存。这是一个建议，但它可能会扭曲分布。如果您的哈希范围比 CAPACITY 大得多，那应该不是问题。但是您还提到了另一个问题，碰撞，您需要处理它们。但这是一个不同的问题。

int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol)) % CAPACITY;
// and
int idx = oat_hash(working_word, strlen(working_word)) % CAPACITY;

score 0 · Accepted Answer

LZW 压缩当然用于构建二进制文件，通常能够读取二进制文件。

以下代码是有问题的，因为它依赖于new_symbol从不成为\0.

newSymbol[0] = new_symbol; newSymbol[1] = '\0';
strlen(workingWord_and_newSymbol)
strdup(newSymbol)

需要重写才能使用字节数组而不是字符串。

fopen()没有显示。确保一个以二进制打开。 input = fopen(..., "rb");

@Wumpus Q. Wumbley 是正确的，使用int newSymbol.

次要的：

new_symbol并且newSymbol令人困惑。

考虑：

// char *working_word = malloc(2048*sizeof(char));
#define WORKING_WORD_N (2048)
char *working_word = malloc(WORKING_WORD_N*sizeof(*working_word));
// or 
char *working_word = malloc(WORKING_WORD_N);

c - LZW encoding for large file

2 回答 2

Related

Reference