我已经确认没有真正明确的机制来做到这一点,所以我发明了自己的。
总结一下它是如何工作的:我定制了 PartialUpdate beanshell 脚本,以便在最后一英里爬网运行之后,它立即调用我创建的名为 DGIDXTransformer 的自定义组件(即它扩展了 CustomComponent)。此类解压缩并解析 last-mile-crawl 创建的文件,该文件应该输入 DGIDX 并写出该文件的修改版本。具体来说,它会修改所有更新信息,以便更新记录而不是替换为新属性。这有点 hacky,因为 DGIDX 输入文件的格式没有记录,但根据我的研究,在 Endeca 的未来版本中,格式不太可能发生很大变化。
这是 DGIDXTransformer:
import com.endeca.soleng.eac.toolkit.component.*;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import java.io.*;
import java.nio.file.AccessDeniedException;
import java.nio.file.Files;
import java.util.Map;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
/**
* Custom component which runs during the PartialUpdate beanshell script. It transforms the DGIDX-compatible input file
* that CAS produces so that records will be updated instead of replaced.
*
* Expects only one property entry called "dgidxInputFileDirectory", specifying the directory to look in to
* find the file to transform (relative to the config directory).
*
* @author chairbender
*/
public class DGIDXTransformer extends CustomComponent {
private static final String DGIDX_INPUT_FILE_DIRECTORY_PROPERTY_NAME = "dgidxInputFileDirectory";
private static final String RECORD_SPEC_PROPERTY_NAME = "record.spec";
/**
* Does the transformation as specified in the class javadoc.
*/
public void transformDGIDXInputFileToUpdateInsteadOfReplace() throws Exception {
//Find the file in the directory
Map<String, String> properties = getProperties();
if (null == properties || !properties.containsKey(DGIDX_INPUT_FILE_DIRECTORY_PROPERTY_NAME)) {
throw new Exception();
} else {
File directory = new File(properties.get(DGIDX_INPUT_FILE_DIRECTORY_PROPERTY_NAME));
File[] gzipFiles = directory.listFiles(new FilenameFilter() {
@Override
public boolean accept(File dir, String name) {
return name.endsWith(".xml.gz");
}
});
if (gzipFiles == null || gzipFiles.length == 0) {
throw new Exception();
} else {
File gzipFile = gzipFiles[0];
File unzippedFile = unzipFile(gzipFile);
transformInputFile(unzippedFile, unzippedFile.getAbsolutePath().replace(".xml", "transformed.xml"));
//delete the extra files in a way that throws an exception if deletion fails
Files.delete(gzipFile.toPath());
Files.delete(unzippedFile.toPath());
}
}
}
/**
* Gzips the passed file and saves it at the specified location
* @param toGzip file to gzip
* @param outputPath where to output the gzipped file
*
*/
private void gzipFile(File toGzip,String outputPath) throws IOException {
byte[] buffer = new byte[1024];
GZIPOutputStream gzipOutputStream =
new GZIPOutputStream(new FileOutputStream(outputPath,false));
FileInputStream inputStream =
new FileInputStream(toGzip);
int len;
while ((len = inputStream.read(buffer)) > 0) {
gzipOutputStream.write(buffer, 0, len);
}
inputStream.close();
gzipOutputStream.finish();
gzipOutputStream.close();
inputStream.close();
}
/**
*
* @param unzippedFile file representing DGIDX input data to transform
* @param transformedFilePath path where transformed file should go.
* @return the transformed file
*/
private File transformInputFile(File unzippedFile, String transformedFilePath) throws IOException {
File outputFile = new File(transformedFilePath);
//Since the XML and the transformation isn't very complicated, we'll just write it out line by line as we go through the
//unzipped file line-by-line
BufferedReader unzippedFileReader = new BufferedReader(new FileReader(unzippedFile));
BufferedWriter outputFileWriter = new BufferedWriter(new FileWriter(outputFile));
String nextLine;
while ((nextLine = unzippedFileReader.readLine()) != null) {
if (nextLine.contains("RECORD_ADD_OR_REPLACE")) {
//If the line contains RECORD_ADD_OR_REPLACE, need to change it to RECORD_UPDATE
outputFileWriter.write(nextLine.replace("RECORD_ADD_OR_REPLACE","RECORD_UPDATE"));
} else if (nextLine.contains("<PROP NAME=")) {
//if this line contains <PROP NAME="...">, and the property
//name isn't the record spec, we need to transform this element only if it isn't the record spec.
String propertyName = nextLine.split("\"")[1];
if (!propertyName.equals(RECORD_SPEC_PROPERTY_NAME)) {
//Read the property value from the next line
String propertyValueLine = unzippedFileReader.readLine();
String propertyValue = propertyValueLine.replace("<PVAL>","").replace("</PVAL>","").trim();
//Now write the PVAL_DELETE and PVAL_ADD entries
outputFileWriter.write("<PVAL_DELETE><PROPERTY_NAME NAME=\"" + propertyName + "\"/></PVAL_DELETE>");
outputFileWriter.write("<PVAL_ADD><PROP NAME=\"" + propertyName + "\"><PVAL>" + propertyValue + "</PVAL></PROP></PVAL_ADD>");
//Discard the closing element line of the input file
unzippedFileReader.readLine();
} else {
//it's not the record spec, so don't transform it.
outputFileWriter.write(nextLine);
}
} else {
//Just output the line
outputFileWriter.write(nextLine);
}
}
unzippedFileReader.close();
outputFileWriter.close();
return outputFile;
}
/**
*
* @param gzipFile file to un-gzip. Will create the un-gzipped version in the same directory as gzipFile,
* but without the ".gz" ending.
* @return the unzipped version of the file.
*/
private File unzipFile(File gzipFile) throws IOException {
//Un-gzip the file in one pass
GZIPInputStream gzipInputStream =
new GZIPInputStream(new FileInputStream(gzipFile));
File outputFile = new File(gzipFile.getAbsolutePath().replace(".gz",""));
FileOutputStream outputStream =
new FileOutputStream(outputFile);
int len;
byte[] buffer = new byte[1024];
while ((len = gzipInputStream.read(buffer)) > 0) {
outputStream.write(buffer, 0, len);
}
gzipInputStream.close();
outputStream.close();
return outputFile;
}
}
这被编译成一个 JAR,它位于 config/lib/java 中。
这是 DataIngest.xml 中的自定义组件定义:
<custom-component id="DGIDXTransformer" host-id="ITLHost" class="com.chairbender.DGIDXTransformer">
<properties>
<property name="dgidxInputFileDirectory" value="../data/cas_output" />
</properties>
</custom-component>
这是自定义 PartialUpdate 脚本的相关部分:
CAS.runIncrementalCasCrawl("${lastMileCrawlName}");
DGIDXTransformer.transformDGIDXInputFileToUpdateInsteadOfReplace();
CAS.archiveDvalIdMappingsForCrawlIfChanged("${lastMileCrawlName}");