1

通常在 Burrows-Wheeler 变换算法中,使用 $ 字符来表示字符串的结束,但在很多情况下,这个 $ 被省略了。

我想知道如何在不知道最后一个字符的位置的情况下反转它?

例如,我有这个 BWT:

[[[[[1[[11endgnad1234245ndbnbbb]]]]]]]nnnngnabbbdiaaaaaaii

按照该算法,我可以轻松地构造 BWT 矩阵的第一列,我选择以压缩方式表示,如下所示:

Character : Occurrences
1         : 4
2         : 2
3         : 1
4         : 2
5         : 1
[         : 7
]         : 7
a         : 7
b         : 7
d         : 4
e         : 1
g         : 2
i         : 4
n         : 9

在不知道原始字符串中的最后一个字符的情况下,我无法看到如何重建原始字符串。

任何帮助是极大的赞赏。唐

P/S:如果您想知道原始字符串是什么:

[1]ban[2]banana[3]band[4]bandage[12]bin[14]bind[15]绑定

4

1 回答 1

1

You can't (but you can try ;-). Your 1st bwt symbol is the last in the original string 'S'. Now you should unroll the original string backward through LF mapping. It's actually bin[sym] + rank(sym, i) + 1 where you start with i = 0. You can easy get bin[] array from occurences. The problem is that once your 'i' is bigger then omitted '$' you shouldn't add this last '1' so you break the string and things go nasty. You can detect the error if you also reconstruct sa[] and overwrite already set index. So you can set arbitrary $ position to '0' and try to recover, then if it fails set it to 1... until you reconstruct correctly. don't know if this could be optimized.

Cheers,

D.

于 2016-09-15T13:57:02.700 回答