algorithm - 排除数字的算法

Question

你得到一个整数 N，它适合 long（小于 2^63-1）和 50 个其他整数。你的任务是找出从 1 到 N 有多少个数字不包含这 50 个数字的子串？

这个问题来自采访。

score 22 · Accepted Answer

This is only an explanation of what larsmans already wrote. If you like this answer, please vote him up in addition.

A finite automaton, FA, is just a set of states, and rules saying that if you are in state S and the next character you are fed is c then you transition to state T. Two of the states are special. One means, "start here" and the other means "I successfully matched". One of the characters is special, and means "the string just ended". So you take a string and a finite automaton, start in the starting state, keep feeding characters into the machine and changing states. You fail to match if you give any state unexpected input. You succeed in matching if you ever reach the state "I successfully matched".

Now there is a well-known algorithm for converting a regular expression into a finite automaton that matches a string if and only if that regular expression matches. (If you've read about regular expressions, this is how DFA engines work.) To illustrate I'll use the pattern ^.*(44|3).*$ which means "start of the string, any number of characters, followed by either 44 or 3, followed by any number of characters, followed by the end of the string."

First let's label all of the positions we can be in in the regular expression when we're looking for the next character: ^A.*(4B4|3)C.*$

The states of our regular expression engine will be subsets of those positions, and the special state matched. The result of a state transition will be the set of states we could get to if we were at that position, and saw a particular character. Our starting position is at the start of the RE, which is {A}. Here are the states that can be reached:

S1: {A}   # start
S2: {A, B}
S3: {A, C}
S4: {A, B, C}
S5: matched

Here are the transition rules:

S1:
  3: S3
  4: S2
  end of string: FAIL
  any other char: S1
S2:
  3: S3
  4: S3
  end of string: FAIL
  any other char: S1
S3:
  4: S4
  end of string: S5 (match)
  any other char: S3
S4:
  end of string: S5 (match)
  any other char: S4

Now if you take any string, start that in state S1, and follow the rules, you'll match that pattern. The process can be long and tedious, but luckily can be automated. My guess is that larsmans has automated it for his own use. (Technical note, the expansion from "positions in the RE" to "sets of positions you might possibly be in" can be done either up front, as here, or at run time. For most REs it is better to do it up front, as here. But a tiny fraction of pathological examples will wind up with a very large number of states, and it can be better to do those at run-time.)

We can do this with any regular expression. For instance ^([1-9]|1[0-9]|2[0-7])$ can get labeled: ^A([1-9]|1B[0-9]|2C[0-7])D$ and we get the states:

T1: {A}
T2: {D}
T3: {B, D}
T4: {C, D}

and transitions:

T1:
  1: T3
  2: T4
  3-9: T2
  any other char: FAIL
T2:
  end of string: MATCH
  any other char: FAIL
T3:
  0-9: T2
  end of string: MATCH
  any other char: FAIL
T4:
  0-7: T2
  end of string: MATCH
  any other char: FAIL

OK, so we know what a regular expression is, what a finite automaton, and how they relate. What is the intersection of two finite automata? It is just a finite automaton that matches when both finite automata individually match, and otherwise fails to match. It is easy to construct, its set of states is just the set of pairs of a state in the one, and a state in the other. Its transition rule is to just apply the transition rule for each member independently, if either fails the whole does, if both match they both do.

For the above pair, let's actually execute the intersection on the number 13. We start in state (S1, T1)

state: (S1, T1)  next char: 1
state: (S1, T3)  next char: 3
state: (S3, T2)  next char: end of string
state: (matched, matched) -> matched

And then on the number 14.

state: (S1, T1)  next char: 1
state: (S1, T3)  next char: 4
state: (S2, T2)  next char: end of string
state: (FAIL, matched) -> FAIL

Now we come to the whole point of this. Given that final finite automata, we can use dynamic programming to figure out how many strings there are that match it. Here is that calculation:

0 chars:
  (S1, T1): 1
    -> (S1, T3): 1 # 1
    -> (S1, T4): 1 # 2
    -> (S3, T2): 1 # 3
    -> (S2, T2): 1 # 4
    -> (S1, T2): 5 # 5-9
1 chars:
  (S1: T2): 5      # dead end
  (S1, T3): 1
    -> (S1, T2): 8 # 0-2, 5-9
    -> (S2, T2): 1 # 3
    -> (S3, T2): 1 # 4
  (S1, T4): 1
    -> (S1, T2): 6 # 0-2, 5-7
    -> (S2, T2): 1 # 3
    -> (S3, T2): 1 # 4
  (S2, T2): 1      # dead end
  (S3, T2): 1
    -> match:    1 # end of string
2 chars:
  (S1, T2): 14     # dead end
  (S2, T2): 2      # dead end
  (S3, T2): 2
    -> match     2 # end of string
  match:    1
    -> match     1 # carry through the count
3 chars:
  match:    3

OK, that's a lot of work, but we found that there are 3 strings that match both of those rules simultaneously. And we did it in a way that is automatable and scaleable to much larger numbers.

Of course the question we were originally posed was how many matched the second but not the first. Well we know that 27 match the second rule, 3 match both, so 24 must match the second rule but not the first.

As I said before, this is just larsmans solution elucidated. If you learned something, upvote him, vote for his answer. If this material sounds interesting, go buy a book like Progamming Language Pragmatics and learn a lot more about finite automata, parsing, compilation, and the like. It is a very good skillset to have, and far too many programmers don't.

score 22 · Accepted Answer

你没有指定一个基数，但我会假设十进制而不失一般性。

首先，认识到这是一个字符串问题，而不是数字问题。

构建一个有限自动机A以将 50 个整数识别为其他字符串的子字符串。例如，两个整数 44 和 3 被 RE 识别为子字符串

^.*(44|3).*$

构建一个有限自动机B以识别所有小于N的数字。例如，要识别十进制的 1 到 27（包括），这可以通过编译 RE

^([1-9]|1[0-9]|2[0-7])$

计算自动机A和B的交集C，这又是一个 FA。

使用动态编程算法来计算C识别的语言的大小。从A识别的语言的大小中减去它，由相同的算法计算。

（我并不是说这是渐近最优的，但它的速度足以解决许多欧拉计划问题：）

algorithm - 排除数字的算法

2 回答 2

Related

Reference