string - 搜索关键字列表

Question

我正在构建一个消息传递反垃圾邮件解决方案，我必须将收到的每条短信与关键字列表进行比较，如果短信包含列表中的一个关键字，我必须删除它。

问题是搜索关键字列表的最佳算法是什么？示例如下

text message received is "hi how are you, visit us at www.xyz.com"

列表示例如下

www.abc.com
www.xyz.com
...
...

score 1 · Accepted Answer

如果有很多关键字，尤其是具有公共前缀的关键字，则trie在这里可能会很好用。

我假设你想要子字符串，而不仅仅是单词，即给定一个关键字bah，它会bah在bahama. 修改它以防止这种情况应该不难。

我还假设您没有关键字，并且它是作为关键字的子字符串（即bah，bahama不能都是关键字）。迎合这个也应该不会太难。

只是，对于字符串中的每个字符，从树的顶部开始搜索并继续搜索树中的每个现有指针。一旦其中一个指针到达一个有效的单词，就随心所欲地使用它，并可能删除树中的所有指针。

复杂：

O(max(n², mn))其中m是树中的节点数，在最坏的情况下，尽管平均情况下的性能应该要好得多。

例子：

所以，假设我们有关键字：

ab
b
caa

我们可能会得到一棵树：

      o
     /|\
  a / | \ c
   /  |b \
  o   o   o
  | b     | a
  o       o
          | a
          o

（o只是一个节点）

现在，对于输入字符串caab，我们首先看c：（x表示树中的指针）

      o
     /|\
  a / | \ c
   /  |b \
  o   o   x
  | b     | a
  o       o
          | a
          o

注意右边的新指针。

然后a：

      o
     /|\
  a / | \ c
   /  |b \
  x   o   o
  | b     | a
  o       x
          | a
          o

注意左边的新指针和右边的指针前进。

然后a：

      o
     /|\
  a / | \ c
   /  |b \
  o   o   o
  | b     | a
  o       o
          | a
          x

注意左边的指针消失了，右边的指针前进了。

现在我们删除右边的那个，因为我们找到了一个有效的单词。

然后b：

      o
     /|\
  a / | \ c
   /  |b \
  o   x   o
  | b     | a
  o       o
          | a
          o

请注意中间的新指针，然后我们也将其删除，因为我们找到了一个有效的单词。

score 0 · Accepted Answer

你在说多少个关键词？研究 Boyer-Moore 字符串搜索算法，它可能适合您的目的并且实现起来并不难。这是取自维基百科文章的 java 实现：

 /**
   * Returns the index within this string of the first occurrence of the
   * specified substring. If it is not a substring, return -1.
   *
   * @param haystack The string to be scanned
   * @param needle The target string to search
   * @return The start index of the substring
   */
  public static int indexOf(char[] haystack, char[] needle) {
    if (needle.length == 0) {
      return 0;
    }
    int charTable[] = makeCharTable(needle);
    int offsetTable[] = makeOffsetTable(needle);
    for (int i = needle.length - 1, j; i < haystack.length;) {
      for (j = needle.length - 1; needle[j] == haystack[i]; --i, --j) {
        if (j == 0) {
          return i;
        }
      }
      // i += needle.length - j; // For naive method
      i += Math.max(offsetTable[needle.length - 1 - j], charTable[haystack[i]]);
    }
    return -1;
  }

  /**
   * Makes the jump table based on the mismatched character information.
   */
  private static int[] makeCharTable(char[] needle) {
    final int ALPHABET_SIZE = 256;
    int[] table = new int[ALPHABET_SIZE];
    for (int i = 0; i < table.length; ++i) {
      table[i] = needle.length;
    }
    for (int i = 0; i < needle.length - 1; ++i) {
      table[needle[i]] = needle.length - 1 - i;
    }
    return table;
  }

  /**
   * Makes the jump table based on the scan offset which mismatch occurs.
   */
  private static int[] makeOffsetTable(char[] needle) {
    int[] table = new int[needle.length];
    int lastPrefixPosition = needle.length;
    for (int i = needle.length - 1; i >= 0; --i) {
      if (isPrefix(needle, i + 1)) {
        lastPrefixPosition = i + 1;
      }
      table[needle.length - 1 - i] = lastPrefixPosition - i + needle.length - 1;
    }
    for (int i = 0; i < needle.length - 1; ++i) {
      int slen = suffixLength(needle, i);
      table[slen] = needle.length - 1 - i + slen;
    }
    return table;
  }

  /**
   * Is needle[p:end] a prefix of needle?
   */
  private static boolean isPrefix(char[] needle, int p) {
    for (int i = p, j = 0; i < needle.length; ++i, ++j) {
      if (needle[i] != needle[j]) {
        return false;
      }
    }
    return true;
  }

  /**
   * Returns the maximum length of the substring ends at p and is a suffix.
   */
  private static int suffixLength(char[] needle, int p) {
    int len = 0;
    for (int i = p, j = needle.length - 1;
         i >= 0 && needle[i] == needle[j]; --i, --j) {
      len += 1;
    }
    return len;
  }

string - 搜索关键字列表

2 回答 2

Related

Reference