java - 在大海捞针中找到多根针 - 字符串搜索

Question

问题：如何从具有亚线性性能的文档中查找内容主体中是否存在字符串，以及必须按顺序或与其相关联的 id 而不是字母顺序来查找字符串的位置。

最好我们会在 PHP 和/或 JAVA 中解决这个问题

trie 或 Knuth-Pratt-Morris 或 boyer-moore 实现或其他类似算法能否帮助在亚线性时间内找到这些匹配项，如果可以，你能告诉我如何。

更多细节

列表长度可能是数百万行。每个字符串可以包含字符 (a-z0-9) 和空格，即“堆栈溢出”、“堆栈溢出” 每个字符串都有一个唯一标识符 (id)，它是一个整数。{"s":"stackoverflow", "#":"920001"} 匹配或找到的字符串应按其唯一标识符的顺序查找。也值得注意。字符串列表不会经常更改。内容可以。

*例子

一个字符串数组（920001 个唯一字符串）和 2 个文档示例。在内容中检查我们列表中的存在字符串。继续查找匹配项，直到找到 3 个字符串或列表用完为止。当在内容中找到字符串时，新数组中的字符串匹配[]

如您所见，字符串“stackoverflow”在列表末尾很长，但在示例 2 中，我们只会匹配字符串，其中一个是 stackoverflow，使用简单的循环和匹配将花费相当多的时间来匹配的字符串数组。

为此，请将下面的列表视为有 920001 行，并且 12 到 920000 之间的行中的字符串不包含任何匹配项。

** 示例列表

"strings":[
    {"s":"Disney World", "#":"1"}, 
    {"s":"Universal Studios", "#":"2"}, 
    {"s":"Disneyland", "id":"3"}, 
    {"s":"Slide", "id":"4"}, 
    {"s":"Disneyland", "id":"5"}, 
    {"s":"Plane", "id":"6"}, 
    {"s":"Walt Disney World", "#":"7"}, 
    {"s":"Florida", "#":"8"}, 
    {"s":"Puerto Rico", "#":"9"}, 
    {"s":"Dominican Republic", "id":"10"}, 
    {"s":"Las Vegas", "#":"11"},
    {"s":"Mexico", "#":"12"}
    ....
    ....
    {"s":"United States", "#":"920000"}
    {"s":"stackoverflow", "#":"920001"}
]

** 内容示例

content = "Bordered on the west by the Gulf of Mexico and on the east by the Atlantic Ocean, Florida has the longest coastline in the contiguous United States and its geography is dominated by water and the threat of frequent hurricanes. Whether you’re a native or just visiting stackoverflow"

content ="tourist attractions and amusement parks. Slide to the seaside hot spots and abundant nightlife, what you need to stay on top of all of the new developments in the Panhandle State today stackoverflow"

这就是我所看到的问题。

score 2 · Accepted Answer

为您的内容构建一个后缀树（合并每个内容的所有后缀树），然后在此后缀树中搜索您的字符串。

如果您使用Ukkonen 算法，它是线性的（=O(n+m) 其中 n 是内容的大小，m 是字符串的大小）。

您无法实现亚线性性能，因为如果匹配，您需要至少读取一次所有内容。

java - 在大海捞针中找到多根针 - 字符串搜索

1 回答 1

Related

Reference