我的问题
如何使用分布式查询 (solrj) 在分片设置中将(最多约 30'000'000 个)solr 文档导出到 csv?
我的策略是分批查询(n 天),但我目前达到了每批约 200'000 个文档的限制。
我希望能够每批获取 1'000'000。
我的设置是一个带有多个分片的 solr 索引。每个分片都保存一个月的文件。根据时间戳字段将文档添加到分片中。我使用 shards 参数集进行查询,这通常效果很好。
现在我想将文档或只是一些字段导出到 csv 文件中。但是有很多文件我的请求失败了。我剥离了我的网址,但这个请求失败了:
// query I) query march 2013 sharded -> does not work
http://localhost:8080/index/in.part.201301/select/?rows=1000000&
shards=localhost:8080/index/in.part.201303&
wt=csv&
fl=id&
q=firstTimestamp_dis:[2013-03-01T00:00:00Z+TO+2013-04-01T00:00:00Z]&
version=2.2
索引服务器上的异常:
14:18:55,726 SEVERE [SolrCore] java.lang.NullPointerException
at java.io.StringReader.<init>(StringReader.java:33)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:203)
at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80)
at org.apache.solr.search.QParser.getQuery(QParser.java:142)
at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:101)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at com.company.InitializerDispatchFilter.doFilter(InitializerDispatchFilter.java:93)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:190)
at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92)
at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextEstablishmentValve.java:126)
at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:70)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:158)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:829)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:598)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:662)
19:16:44,638 INFO [SolrCore] [in1.part.201303] webapp=/index path=/select params={} status=500 QTime=2
19:16:44,647 SEVERE [SolrCore] org.apache.solr.common.SolrException: Internal Server Error
Internal Server Error
request: http://localhost:8080/ipc-index/in1.part.201303/select
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:432)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246)
at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:421)
at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:393)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
非分片查询有效:
// query II) query march 2013 non sharded --> works
http://localhost:8080/index/in.part.201303/select/?rows=1000000&
fl=id&
q=firstTimestamp_dis:[2013-03-01T00:00:00Z+TO+2013-04-01T00:00:00Z]&
version=2.2&
wt=csv
和
// query III) sharded query with rows=200000 --> works as well, (rows=210000 does fail like query I)
http://localhost:8080/index/in.part.201301/select/?rows=200000&
fl=id&
q=firstTimestamp_dis:[2013-03-01T00:00:00Z+TO+2013-04-01T00:00:00Z]&
version=2.2&
wt=csv&
shards=localhost:8080/index/in.part.201303
内存
我不认为问题与内存有关:如果我将内存减少到 256MB 并执行查询 III)我的索引服务器 vm 有 1GB 内存)它将执行非常慢并因内存不足而中止。如果我增加内存查询 I) 仍然会失败。
此外,如果我使用查询 III) 在字段列表中添加更多字段,它将始终成功。
在我的客户端 (slorj) 上,我使用 Method.POST 发送查询。
有人可以帮忙吗?