假设文本文件中有一个 URL 列表(以百万为单位),并且文本文件中有另一个列表,其中包含列入黑名单的单词。
我愿意对 URL 列表做如下处理。
- Parse the URLs and store them in some DS
- Process the URLs and blacklist those URLs which contain atleast one of the
blacklisted words.
- If there exists a URL containing 50% or more blacklisted words, add the other
words of that URL in the list of blacklisted words.
- Since now the blacklisted words list has been modified then it's probable
that the URLs which were not blacklisted earlier can get blacklisted now. So,
the algorithm should handle this case as well and mark the earlier whitelisted
URLs as blacklisted if they contain these newly added blacklisted words.
最后,我应该有一个列入白名单的 URL 列表
有什么建议可以用来实现最有效的时间和空间复杂度解决方案的最佳算法和 DS 吗?