java - 在 Java 中处理 HTTP 调用的大文件

Question

我有一个需要处理的包含数百万行的文件。文件的每一行都会产生一个 HTTP 调用。我试图找出解决问题的最佳方法。

我显然可以读取文件并按顺序进行调用，但这会非常慢。我想并行化调用，但我不确定是否应该将整个文件读入内存（我不太喜欢）或尝试并行化文件的读取（我'我不确定是否有意义）。

只是在这里寻找一些关于解决问题的最佳方法的想法。如果有一个现有的框架或库可以做类似的事情，我也很乐意使用它。

谢谢。

score 5 · Accepted Answer

我想并行化调用，但我不确定是否应该将整个文件读入内存

你应该使用一个ExecutorService有界的BlockingQueue。当您阅读百万行时，您会将作业提交到线程池，直到线程池BlockingQueue已满。这样，您将能够同时运行 100 个（或任何最佳数量）的 HTTP 请求，而无需事先读取文件的所有行。

RejectedExecutionHandler如果队列已满，您需要设置一个阻止。这比调用者运行处理程序要好。

BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(100);
// NOTE: you want the min and max thread numbers here to be the same value
ThreadPoolExecutor threadPool =
    new ThreadPoolExecutor(nThreads, nThreads, 0L, TimeUnit.MILLISECONDS, queue);
// we need our RejectedExecutionHandler to block if the queue is full
threadPool.setRejectedExecutionHandler(new RejectedExecutionHandler() {
       @Override
       public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) {
           try {
                // this will block the producer until there's room in the queue
                executor.getQueue().put(r);
           } catch (InterruptedException e) {
                throw new RejectedExecutionException(
                   "Unexpected InterruptedException", e);
           }
    }
});

// now read in the urls
while ((String url = urlReader.readLine()) != null) {
    // submit them to the thread-pool.  this may block.
    threadPool.submit(new DownloadUrlRunnable(url));
}
// after we submit we have to shutdown the pool
threadPool.shutdown();
// wait for them to complete
threadPool.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);

...
private class DownloadUrlRunnable implements Runnable {
    private final String url;
    public DownloadUrlRunnable(String url) {
       this.url = url;
    }
    public void run() {
       // download the URL
    }
}

score 0 · Accepted Answer

格雷的方法似乎很好。我建议的另一种方法是将文件分成块（您必须编写逻辑），并使用多个线程处理这些文件。

java - 在 Java 中处理 HTTP 调用的大文件

2 回答 2

Related

Reference