I am using the Extracting Request Handler to index html and pdf files. Along with what tika finds I want to add metadata above and beyond content from tika. To do this I use the literal.= support. Unless I use dynamic fields "*_s" the data is not saved. Only the id field seems to work as advertised. I'm sure that I'm doing something wrong. My schema.xml field definitions:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- The following fields don't work, need to use dynamic fields for some reason -->
<field name="region" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="href" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="services" type="text_general" indexed="false" stored="true" multiValued="true" />
My Solrj code:
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
ContentStream contentStream = new ContentStreamBase.StringStream(contentBean.getContent());
req.addContentStream(contentStream);
req.setParam("literal.region", region);
req.setParam("literal.href", contentBean.getHref());
req.setParam("literal.id", getDocId(url));
for (Map.Entry<String,String> entry : getFacetsFromURL(url).entrySet()) {
logger.info("Setting facet field {} to {}", entry.getKey(), entry.getValue());
req.setParam("literal." + entry.getKey(), entry.getValue());
}
// index h1 tag
req.setParam("fmap.tags_h1", "h1");
req.setParam("capture", "h1");
// index img tag
req.setParam("fmap.img", "tags_img");
req.setParam("capture", "img");
// lowercase tag names
req.setParam("lowernames", "true");
/*
* Passing commitWithin as a parameter seems
* to be the only way to get it to work with
* this request handler
*/
req.setParam("commitWithin", "10000");
/*
* Now do the work!
*/
req.process(server);
Changing region to region_s, href to href_s and adding _s to the key value in the map, works. I don't understand why region etc don't get saved unless it's matching the *_s dynamic field in the schema. I noticed a few other issues. I tried to use a copyField to move one of the literal fields to a field for faceting, I never see any data in the facet field. Here are some of the ways I tried this
<field name="services_facet" type="string" indexed="true" stored="false" multiValued="true" />
<copyField source="services_s" dest="services_facet"/>
There is never anything in services_facet. I can facet on services_s but shouldn't this work? Is Solr-Cell broken or just poorly documented?