solr - 为什么 Solr for Windows 需要这么多内存？

Question

为什么 Solr for Windows 需要这么多内存？

我的 Solr 数据是 SEO 关键字（1-10 个词，最多 120 个符号长度，8 亿行）和一些其他数据。架构是：

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="suggests" version="1.5">
<copyField source="suggest" dest="suggest_exact"/>

<types>
    <fieldType name="text_stem" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.SnowballPorterFilterFactory" language="Russian" />
        </analyzer>
    </fieldType>
    <fieldType name="text_exact" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
</types>
<fields>
    <field name="suggest" type="text_stem" indexed="true" stored="true"/>
    <field name="suggest_exact" type="text_exact" indexed="true" stored="false"/>
    <field name="length" type="int" indexed="true" stored="true"/>
    <field name="position" type="int" indexed="true" stored="true"/>
    <field name="wordstat1" type="int" indexed="true" stored="true"/>
    <field name="wordstat3" type="int" indexed="true" stored="true"/>
    <field name="ln" type="int" indexed="true" stored="true"/>
    <field name="wc" type="int" indexed="true" stored="true"/>
 </fields>

Solr for Windows 占用约 10 GB 的 RAM，有时需要更多（高达 16 GB）。现在我将它配置为使用SOLR_JAVA_MEM=-Xms8192m -Xmx16384m并且它可以工作，但是当它是 4 GB 或更少时 - Java 因错误 OutOfMemory 而崩溃。

那么，我做错了什么？如何配置 Solr 以减少 RAM？我可以提供任何部分solrconfig.xml。

solrconfig.xml

<query>
    <maxBooleanClauses>1024</maxBooleanClauses>
    <filterCache class="solr.FastLRUCache"
                 size="512"
                 initialSize="512"
                 autowarmCount="0"/>
    <queryResultCache class="solr.LRUCache"
                     size="512"
                     initialSize="512"
                     autowarmCount="0"/>
    <documentCache class="solr.LRUCache"
                   size="512"
                   initialSize="512"
                   autowarmCount="0"/>
    <cache name="perSegFilter"
      class="solr.search.LRUCache"
      size="10"
      initialSize="0"
      autowarmCount="10"
      regenerator="solr.NoOpRegenerator" />

    <enableLazyFieldLoading>true</enableLazyFieldLoading>

    <queryResultWindowSize>20</queryResultWindowSize>

    <queryResultMaxDocsCached>200</queryResultMaxDocsCached>

    <useColdSearcher>false</useColdSearcher>

    <maxWarmingSearchers>2</maxWarmingSearchers>

</query>

所以，我到底在做什么和想要什么。

我向 Solr 添加了 8 亿行。这还不是全部——我有 30 亿行的数据集。行是搜索引擎优化关键词，如“求职”、“在纽约找工作”等。“建议”字段包含许多相同的常用词，如“工作”、“下载”等。我认为，“下载”一词存在于所有行的 10% 中。

我做服务，用户可以在其中进行“下载”等查询并获取所有包含“下载”一词的文档。

我创建了一个桌面软件 (.NET) 来在 Web 服务界面 (PHP+MySQL) 和 Solr 之间进行通信。该软件从 Web 服务获取任务，向 Solr 进行查询，下载 Solr 结果并将其提供给用户。

为了获得所有结果，我将 GET-query 发送到 Solr，例如：

http://localhost:8983/solr/suggests2/select?q=suggest:(job%20AND%20new%20AND%20york)&fq=length:[1%20TO%2032]&fq=position:[1%20TO%2010]&fq=wc:[1%20TO%2032]&fq=ln:[1%20TO%20256]&fq=wordstat1:[0%20TO%20*]&fq=wordstat3:[1%20TO%20100000000]&sort=wordstat3%20desc&start=0&rows=100000&fl=suggest%2Clength%2Cposition%2Cwordstat1%2Cwordstat3&wt=csv&csv.separator=;

如您所见 - 我使用 fq 和排序而不使用分组。也许有人看到我在 Solr 查询或方法中的错误 - 请随时告诉我。谢谢。

score 1 · Accepted Answer

您正在对未打开 DocValues 的 TrieIntField 进行排序。这意味着 Solr 将在堆上保留一份值的副本。对于 800M 的值，仅用于此目的就是 3.2GB 的堆。设置-field 和docValues="true"重新wordstat3索引应该会大大降低该要求，但会牺牲一些性能。

请注意，Solr（实际上是 Lucene）在单个分片中不支持超过 20 亿个文档。这是一个硬限制。如果您计划将 30 亿个文档索引到同一个逻辑索引中，则必须使用多分片 SolrCloud。

solr - 为什么 Solr for Windows 需要这么多内存？

1 回答 1

Related

Reference