java - Efficient method to check for matching files in Java

Question

I'm no Java expert but the program I'm making is going to be dealing with high throughput. So I thought I'd do a little crowd sourcing for opinions. Here's the situation.

A java process will be watching a directory for files to process, these files will be paired (data file to be stored and xml file with meta information to be cataloged). So I need to get the list of current files, check for the required twins, and then process.

Files will always have matching filenames and only differ by file extension e.g. filename1.jpg filename1.xml filename2.jpg filename2.xml

I have three options I've thought of so far.

Use FilenameFilter with File.List(FileNamefilter) call to check if the total files with a filename is greater than 1.
Use two filenamefilters to generate a list of files with .xml and without .xml, convert the non XML file list to an ArrayList and call Collections.binarySearch().
Generate a list of all files without .xml extension, use this list as the keys for a hashmap of key/value pairs that assumes the .xml file based on the filename. Then run through the hash list and check for the existence of the .xml twin before processing.

Any thoughts?

EDITS/COMMENTS

After looking at the suggestions and tinkering I'm for now going with using two FilenameFilters, one that lists XML files and one that does not. The list of XML files is stripped of the xml extension and dumped into a hash. Then the list of data files is iterated through, calling hashlist.contains() to see if a match exists in the hashset before proceeding.

There is the concern as mentioned below of processing incomplete files. As I said in comments, I assume that a newly written file is not visible to non-writing processes until that write is complete (new files, not open for edit)

score 3 · Accepted Answer

获取所有文件，对它们进行排序，然后对文件名进行线性传递，并查看哪些同意前缀。显然，它们应该在排序列表中彼此相邻。

这应该比过滤器和哈希图更简单、更快！

要监视目录，您可能需要使用通知基础 API，例如可用时的 inotify。然后操作系统将在文件夹内容发生更改时发出信号。

score 0 · Accepted Answer

这有点偏离主题，但鉴于所陈述的意图，我希望与此处发布足够相关。

该问题没有说明文件如何到达目录。如果它们通过网络或互联网进入，或从另一个进程流式传输，则交付可能不是即时的，从而导致选择和处理尚未完全交付的文件（例如 jpeg 文件的一半）的风险。

如果你有高吞吐量，那么如果你允许它，这是会发生的情况。即使您在处理之前短暂延迟，它也可能迟早会以某种方式发生。

处理此问题的常用策略是传递到中间文件名（或者更好的是，相邻文件夹）。交付完成后，交付过程会将文件重命名或移动到正确的名称和位置。这一举动实际上是即时的（原子的）。在 ftp 的情况下，至少一个众所周知的工具会自动执行这些步骤。

如果您的部分交付的文件位于同一个文件夹中，并且仅使用备用文件扩展名进行了重命名，那么这可能是一个话题，这可能与该问题中提到的一些选项有关。

java - Efficient method to check for matching files in Java

2 回答 2

Related

Reference