c - C桶串

Question

我有一个任务要完成。它说我必须读取一个包含 300 万个字符串的文件。
我必须阅读文件并构建一个结构来保存字符串。该系统必须能够回答“这个新字符串是否存在？”的问题。

我还希望将列表分解为字符串的“桶”，以便“匹配的字符串”能够（快速）选择正确的桶进行搜索，并且该桶应该包含不超过总/散列掩码字符串左右（即每个桶 3,000,000 / 0xFFF == 732 个对象）。

现在我已经创建了一个哈希表结构，列表和函数来读取文件，添加和删除函数。但我不知道以粗体输入的文本。我需要在哈希函数中实现一些东西（以粗体请求）吗？

下面是我的示例代码

 #define MAX_NAME 100 
    /* Linked list structure */
    typedef struct list
    {
        char *string;
        int index;
        struct list *next
    } list_t ;

     /* hash table structure*/

     typedef struct hashTable
    {
        int size; // size of the table
        list_t **table; // the table element
    } hash_table_t;

    HashListType *createHashTable( size_t size)
   {        
     // allocate hash table ..I know how to do it    
   }    
    unsigned int hash(HashListType *hashTable, void *str )     
     {        
        uint64_t hashVal;    
        hashVal = 0;    
       while( *str != '\0')   
       {
         hashVal = *str + (hashVal << 5 ) - hashVal;    
         str++;    
       }     
      return (hashVal % hashTable->size);     
     }      

    void addToHashList( HashListType *list, void *obj, uint64_t hash)    
   {    

      // add item of new list to table  --> have an idea how to do it       
   }       

  void removeFromHashList(HashListType *list, void *criterion, uint64_t hash )      
   {
      // got an idea how to do it       
   }      
   /*        
      this  function will read the file (assume one string per line)     
      and create the list of lists (list of buckets), adding one object per string.    
   */     
     HashList *loadDataSet(char *filename, int hashMask)     
     {     
        // to read a file
       char readString[ MAX_NAME];
       File *fp ;

        if( (fp = fopen(filename, "r") )== NULL)
        {
          printf(" failed to open the file\n");
          exit(0);
        }
        while( fgets ( readString,MAX_NAME -1, fp ) != NULL)
        {
         //need to break the list down into "buckets" of strings so the 'string to match'
         // is able to chose the correct bucket to search in (quickly)
         //and that bucket should contain no more than total/hashMask strings
         or so (ie 3,000,000   / 0xFFF == 732 objects per bucket). 
        }
      fclose(fp);
     }

score 2 · Accepted Answer

我相信您为哈希表选择了不正确的数据结构：

typedef struct hashTable
{   
  char key[MAX_NAME];
  int index;
  struct hashTable *next;
  struct hashTable *prev;
};

哈希表的主要好处之一是能够直接跳转到包含您正在搜索的元素的存储桶。这是哈希桶链表的一部分——这意味着您必须在每次查找或插入时平均遍历 4098/2 个桶。这不会为您提供所需的性能。

您的哈希表应该是一个数组structs；每个都struct应该有一个指向字符串的指针（或直接存储短字符串）和指向struct桶中下一个的指针。（虽然这struct hashTable也可能是桶内结构，但它是一个罕见的哈希表，需要next和prev桶内的链接。这就是为什么我猜这个数据结构是为表本身而设计的。）

你还需要选择一个好的散列函数。有大量关于良好哈希函数的研究，但你真的在寻找比可怕的家庭作业更好的东西。哈希函数的输入是你的字符串，输出应该是一个整数。您需要%使用数组大小的输出（选择接近 5000 的素数）来确定要使用的存储桶。

这是来自stb.h方便函数库的哈希函数：

unsigned int stb_hash(char *str)
{
   unsigned int hash = 0;
   while (*str)
      hash = (hash << 7) + (hash >> 25) + *str++;
   return hash + (hash >> 16);
}

一个简短的提示，虽然stb.h代码在公共领域，但在程序中引用源代码是非常明智的——教授、律师，将来，你的同事会感谢你提供事物的源代码你没有做自己。

score 1 · Accepted Answer

哈希函数不仅可以为整数定义，还可以为字符或字符串定义（提示：字符编码）。为字符串制作哈希函数。提交时，必须与输出文件一起提交或运行。

score 0 · Accepted Answer

注意：此答案取决于您的作业文本对使用“桶”的严格程度，因为我对您的问题的解释比您的示例代码更自由一些。

这项任务的最佳数据结构无疑是Trie或它的泛化。您可以构建一棵树，其中每个节点都包含存储一个字符串原子的“微小”哈希表。例如，字符串的原子可以是单个字符。您可以对数据结构进行参数化以更改原子的大小（即每个节点都有一个包含 16 个子尝试的固定数组，因此您的原子长度为 4 位）——这种数组方法允许恒定时间下降，但需要相对内存大。但正如我所说，您可以使用小型哈希表（这将更适合您的分配），而不是快速查找数组。

c - C桶串

3 回答 3

Related

Reference