solr - Solr 4：禁用存储字段的压缩：如何实际配置自定义编解码器？

Question

简短的问题是：

我想在 Solr 4.3.0 索引上禁用存储字段压缩。看完后：

http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1

http://wiki.apache.org/solr/SimpleTextCodecExample

http://www.opensourceconnections.com/2013/06/05/build-your-own-lucene-codec/

我决定按照那里描述的路径，制作我自己的编解码器。我很确定我已经完成了所有步骤，但是，当我真正尝试使用我的编解码器（亲切地命名为“UncompressedStorageCodec”）时，我在 Solr 日志中收到以下错误：

java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
        at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)

从输出中我得到 Solr 没有使用我的自定义编解码器拾取 jar，我不明白为什么？

这是所有可怕的细节：

我创建了一个这样的类：

public class UncompressedStorageCodec extends FilterCodec {
    private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat();

    protected UncompressedStorageCodec() {
        super("UncompressedStorageCodec", new Lucene42Codec());
    }

    @Override
    public StoredFieldsFormat storedFieldsFormat() {
        return fieldsFormat;
    }
}

在包中：“fr.company.project.solr.transformers.utils”

“FilterCodec”的 FQDN 是：“org.apache.lucene.codecs.FilterCodec”

我已经创建了一个基本的 jar 文件（从 Eclipse 将其导出为 jar）。

我用来测试的 Solr 安装是基本的 Solr 4.3.0 解压缩，并通过它的嵌入式 Jetty 服务器和使用示例核心启动。

我已经将我的 jar 和编解码器放在 [solrDir]\dist

在：

[solrDir]\example\solr\myCore\conf\solrconfig.xml

我已经添加了这一行：

<lib dir="../../../dist/" regex="myJarWithCodec-1.10.1.jar" />

然后在 schema.xml 文件中，我声明了一些应该使用此编解码器的 fieldTypes，如下所示：

<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" postingsFormat="UncompressedStorageCodec"/>
<fieldType name="string_lowercase" class="solr.TextField" positionIncrementGap="100" omitNorms="true" postingsFormat="UncompressedStorageCodec">
<!--...-->
</fieldType>

现在，如果我使用 DataImportHandler 组件将一些数据导入 Solr，在提交时它会告诉我：

java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
        at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)

我觉得奇怪的是，上面提到的编解码器 jar 还包含一些用于 DataImportHandler 组件的 Transformer。这些都很好。此外，放置在 dist 文件夹中的其他 jars（并在 solrconfig.xml 中以相同的方式声明），如 jdbc 驱动程序也可以正常提取。我猜对于编解码器，有这个 SPI thingy 可以加载不同的东西，并且他缺少一些东西......

我还尝试将编解码器 jar 放入：

[solrDir]\example\solr-webapp\webapp\WEB-INF\lib\

以及在 solr.war 文件的 WEB-INF\lib 文件夹中，该文件位于：

[solrDir]\example\webapps\

但我仍然遇到同样的错误。

所以基本上，我的问题是，Solr 拾取我的编解码器 jar 缺少什么？

谢谢

score 3 · Accepted Answer

我将自己回答这个问题，因为由于我所做的一些基准测试，它变得毫无意义：长话短说，我得出了一个（错误的）结论，即对于非常大的存储字段，Solr 3.x 和 4.0 （不带场压缩）比 Solr 4.1 及更高版本（带场压缩）更快。然而，这主要是由于我的基准测试中的一些错误。在重复它们之后，我得到了结果，当您从非压缩字段变为压缩字段时，即使对于非常大的存储字段，索引时间也会慢 0% 到 15%，考虑到事后查询，这真的一点也不差压缩字段索引的速度提高了 10-20%（文档获取部分）。

另外，这里有一些关于如何加快索引的评论：

使用DataImportHandler插件。它绕过 Solr Rest（基于 HTTP）API 并直接写入 Lucene 索引。
查看上述插件源以了解它是如何实现这一点的，如果 DataImportHandler 不满足您的需求，请制作您自己的插件
如果出于某种原因您想坚持使用 Solr Rest API，请使用ConcurrentUpdateSolrServer并使用队列大小和线程数参数。它通常会比基本的 HttpSolrServer 快得多（在我的情况下高达 200%）。
不要忘记像这样启用 javabin 数据序列化：

ConcurrentUpdateSolrServer solrServer = new ConcurrentUpdateSolrServer(" http://some.solr.host:8983/solr ", 100, 4); solrServer.setRequestWriter(new BinaryRequestWriter());

我明确地显示了代码，因为我相信这里可能存在一个小错误：

如果您查看 ConcurrentUpdateSolrServer 构造函数，您会看到默认情况下它已经将请求编写器设置为二进制：

  //the ConcurrentUpdateSolrServer initializes HttpSolrServer objects using this constructor:
  public HttpSolrServer(String baseURL, HttpClient client) {
    this(baseURL, client, new BinaryResponseParser());
  }

但是在调试之后我注意到，如果你没有显式地调用带有 Binary writer 参数的 setWriter 方法，它仍然会使用 XmlSerializer。

从 XML 到二进制序列化将我的文档的大小减少了大约 3 倍，因为它们被发送到服务器。这使我在这种情况下的索引时间快了大约 150-200%。

score 0 · Accepted Answer

我最近尝试并成功地获得了与工作非常相似的东西。唯一的区别是我想启用最好的压缩而不是不压缩，Solr 默认是最快的压缩。在某些时候，我还收到“SPI 类 [...] 不存在”错误，这是我从各种文章中发现的，包括您链接到的文章。

Lucene 使用 SPI 来查找要加载的编解码器类。Lucene 要求在文件“org.apache.lucene.codecs.Codec”中声明编解码器类的列表，并且该文件必须在类路径上。让 Solr 加载文件：创建 JAR 文件“myJarWithCodec-1.10.1.jar”时，确保它包含位于“META-INF/services/org.apache.lucene.codecs.Codec”的文件。该文件每行应该有一个完整的类名，如下所示：

org.apache.lucene.codecs.lucene3x.Lucene3xCodec
org.apache.lucene.codecs.lucene40.Lucene40Codec
org.apache.lucene.codecs.lucene41.Lucene41Codec
org.apache.lucene.codecs.lucene42.Lucene42Codec
fr.company.project.solr.transformers.utils.UncompressedStorageCodec

在 solrconfig.xml 中，替换：

<codecFactory class="solr.SchemaCodecFactory" />

和：

<codecFactory class="fr.company.project.solr.transformers.utils.UncompressedStorageCodec" />

如果 Solr 抱怨，您可能还需要postingsFormat="UncompressedStorageCodec"从 schema.xml 中删除。我认为这个特定参数用于指定发布格式，而不是编解码器。希望能帮助到你。

solr - Solr 4：禁用存储字段的压缩：如何实际配置自定义编解码器？

2 回答 2

Related

Reference