我正在尝试将 Quartz 调度程序与 crawler4j 结合起来。
问题是,当我在 main 方法中执行 C4J 代码时,它运行良好,但在quartz Job execute() 方法中,出现 Http 连接错误。
我们在代理后面工作,但它已经配置了 winthin C4j,我们甚至在 Quartz 中尝试过。
你知道 Quartz 是否可以阻止 Http Connection 吗?
错误堆栈跟踪:
Exception in thread "Crawler 1" java.lang.NoSuchFieldError: DEF_PROTOCOL_CHARSET
at org.apache.http.auth.params.AuthParams.getCredentialCharset(AuthParams.java:64)
at org.apache.http.impl.auth.BasicScheme.authenticate(BasicScheme.java:157)
at org.apache.http.client.protocol.RequestAuthenticationBase.authenticate(RequestAuthenticationBase.java:125)
at org.apache.http.client.protocol.RequestAuthenticationBase.process(RequestAuthenticationBase.java:83)
at org.apache.http.client.protocol.RequestProxyAuthentication.process(RequestProxyAuthentication.java:89)
at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:108)
at org.apache.http.protocol.HttpRequestExecutor.preProcess(HttpRequestExecutor.java:174)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:515)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at edu.uci.ics.crawler4j.fetcher.PageFetcher.fetchHeader(PageFetcher.java:156)
at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:232)
at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:189)
at java.lang.Thread.run(Thread.java:662)
执行()方法:
@Override
public void execute(JobExecutionContext context)
throws JobExecutionException {
JobKey key = context.getJobDetail().getKey();
JobDataMap dataMap = context.getJobDetail().getJobDataMap();
String[] sitesTab = dataMap.getString("sites").split(";");
int numberOfCrawlers = 2;
String storageFolder = "C:\\...";
CrawlConfig config = new CrawlConfig();
config.setProxyHost("...");
config.setProxyPort(3128);
config.setProxyUsername("...");
config.setProxyPassword("...");
config.setMaxDepthOfCrawling(2);
config.setCrawlStorageFolder(storageFolder);
config.setIncludeBinaryContentInCrawling(true);
String[] crawlDomains = new String[] { "http://www.....fr/" };
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig,
pageFetcher);
CrawlController controller;
try {
controller = new CrawlController(config, pageFetcher,
robotstxtServer);
for (String domain : crawlDomains) {
controller.addSeed(domain);
}
int minWidth = 150;
int minHeight = 150;
Pattern p = Pattern.compile(".*(\\.(bmp|gif|jpe?g|png))$");
SportifsWebCrawler.configure(crawlDomains, storageFolder, p,
minWidth, minHeight);
controller.start(SportifsWebCrawler.class, numberOfCrawlers);
} catch (Exception e) {
e.printStackTrace();
}
}
感谢您的帮助:)