0

到目前为止,我的数据库中有 27 个表。1个单词表(拼字游戏单词表),26个关联表。

Table  Fields
================
word   [id,word]
a      [word_id,count]
b      [word_id,count]
...
z      [word_id,count]

我试图找出给定字符串的匹配单词。

例如,如果给定的数组是a,n,t我想知道的:ant, tan, at, ta, an, na.

我目前的策略是分解字符串中的每个字母并找到与所有字母匹配的相关单词。

例如:

SELECT word.word
FROM word, a, n, t
WHERE
    word.id = a.word_id OR
    word.id = n.word_id OR
    word.id = t.word_id

但这最终会打印出所有包含 aa,n or t的单词。

如果我将所有运算符切换为 AND,我只会遇到一个匹配项:ant.

你能帮我解开这个谜吗?

我还关心如何处理字符串中的重复字母。我认为count字母关联表中的字段可以在这里提供帮助。如果单词是,则关联表app中的计数将为 2 。p

我在关联表的正确轨道上还是有更好的方法?

我试图在 php/mysql 中相当有效地处理这个问题。我知道之前有其他人在 C、perl、java 等语言中解决了这个谜题。

4

1 回答 1

1

If you want a normalised approach, that would be:

wordLetters{
  INT wordID,
  CHAR[1] letter,
  INT count,
  PK(wordID, letter)
}

words{
  INT wordID PK,
  VARCHAR(255) word UNIQUE
}

but this approach has a serious problem in terms of performance - namely it needs a full table scan on the word table. I am going to assume that there are not too many letters and suggest this approach:

words{
  INT wordID PK,
  VARCHAR(255) word UNIQUE,
  INT cA KEY,
  INT cB KEY,
  ...
  INT cZ KEY,
  KEY (cE, cT, cA, cO, cI, cN),
  ...
}

A lookup query will be long but it will use indexes efficiently and it is generated by the PHP code anyways:

If the user has [a,n,t], fetch the available words as:

SELECT word FROM words WHERE
   cA <= 1 AND cB = 0 AND cC = 0 AND ... AND cY = 0 AND cZ = 0

This query will (probably) use the "ETAOIN" index as not many words that don't need an 'E' exist.

At this point, performance depends on the choice of indexes available for the database only and you can always add more indexes as deemed useful (even at runtime).


On database indexes:

An ordinary index is just a sorted list of items with an appropriate tree built over the list, enabling efficient range lookup (get all elements from x to y).

An ordinary index is defined by its sorting order. The sorting order is: order by some column first, then on another one, then on another one... .

For example, the [E,T,A,O,I,N] index will have all words sorted: first all words that don't need an E, then all words that need one E, then all words that need two E... . The words that need the same amount of Es will be sorted: first all words that don't need a T, then all words that need it once, then all words that need two Ts ... . Of the words that need the same number of Es and Ts, those that don't need an A come first.

If the database is asked to fetch all words that don't need an E or a T and at most one 'X', it can use this index to fulfill the first two requirements, then check all words within the range E=0, T=0.

The particular choice, ETAOIN is based on the phrase ETAOIN SHRDLU which orders the twelve most frequent letters in the english language by their frequency - this means that if this index is used, it should filter out the largest possible amount of words.

You use the example RSTLNE. This index will/may get used when the player has no Rs, or Ss. Benchmarking the lookup may tell you how much time was saved by using each particular index.

You can use an EXPLAIN EXTENDED query to see which indexes are considered and subsequently used for each particular query and how many rows are expected to be filtered out. Ex.:

EXPLAIN EXTENDED
  SELECT word FROM words
  WHERE cA=0 AND cB<=1 AND cC=0 AND ...
于 2012-10-28T20:02:18.290 回答