java - Java SAX 解析器进度监控

Question

我正在用 Java 编写一个 SAX 解析器来解析一个 2.5GB 的 wikipedia 文章 XML 文件。有没有办法监控 Java 中的解析进度？

score 11 · Accepted Answer

感谢 EJP 的建议ProgressMonitorInputStream，最后我扩展FilterInputStream了它，以便ChangeListener可以用来监控当前读取的字节位置。

有了这个，您可以更好地控制，例如显示多个进度条以并行读取大型 xml 文件。这正是我所做的。

因此，可监控流的简化版本：

/**
 * A class that monitors the read progress of an input stream.
 *
 * @author Hermia Yeung "Sheepy"
 * @since 2012-04-05 18:42
 */
public class MonitoredInputStream extends FilterInputStream {
   private volatile long mark = 0;
   private volatile long lastTriggeredLocation = 0;
   private volatile long location = 0;
   private final int threshold;
   private final List<ChangeListener> listeners = new ArrayList<>(4);


   /**
    * Creates a MonitoredInputStream over an underlying input stream.
    * @param in Underlying input stream, should be non-null because of no public setter
    * @param threshold Min. position change (in byte) to trigger change event.
    */
   public MonitoredInputStream(InputStream in, int threshold) {
      super(in);
      this.threshold = threshold;
   }

   /**
    * Creates a MonitoredInputStream over an underlying input stream.
    * Default threshold is 16KB, small threshold may impact performance impact on larger streams.
    * @param in Underlying input stream, should be non-null because of no public setter
    */
   public MonitoredInputStream(InputStream in) {
      super(in);
      this.threshold = 1024*16;
   }

   public void addChangeListener(ChangeListener l) { if (!listeners.contains(l)) listeners.add(l); }
   public void removeChangeListener(ChangeListener l) { listeners.remove(l); }
   public long getProgress() { return location; }

   protected void triggerChanged( final long location ) {
      if ( threshold > 0 && Math.abs( location-lastTriggeredLocation ) < threshold ) return;
      lastTriggeredLocation = location;
      if (listeners.size() <= 0) return;
      try {
         final ChangeEvent evt = new ChangeEvent(this);
         for (ChangeListener l : listeners) l.stateChanged(evt);
      } catch (ConcurrentModificationException e) {
         triggerChanged(location);  // List changed? Let's re-try.
      }
   }


   @Override public int read() throws IOException {
      final int i = super.read();
      if ( i != -1 ) triggerChanged( location++ );
      return i;
   }

   @Override public int read(byte[] b, int off, int len) throws IOException {
      final int i = super.read(b, off, len);
      if ( i > 0 ) triggerChanged( location += i );
      return i;
   }

   @Override public long skip(long n) throws IOException {
      final long i = super.skip(n);
      if ( i > 0 ) triggerChanged( location += i );
      return i;
   }

   @Override public void mark(int readlimit) {
      super.mark(readlimit);
      mark = location;
   }

   @Override public void reset() throws IOException {
      super.reset();
      if ( location != mark ) triggerChanged( location = mark );
   }
}

它不知道 - 也不关心 - 底层流有多大，因此您需要以其他方式获取它，例如从文件本身获取。

所以，这里是简化的示例用法：

try (
   MonitoredInputStream mis = new MonitoredInputStream(new FileInputStream(file), 65536*4) 
) {

   // Setup max progress and listener to monitor read progress
   progressBar.setMaxProgress( (int) file.length() ); // Swing thread or before display please
   mis.addChangeListener( new ChangeListener() { @Override public void stateChanged(ChangeEvent e) {
      SwingUtilities.invokeLater( new Runnable() { @Override public void run() {
         progressBar.setProgress( (int) mis.getProgress() ); // Promise me you WILL use MVC instead of this anonymous class mess! 
      }});
   }});
   // Start parsing. Listener would call Swing event thread to do the update.
   SAXParserFactory.newInstance().newSAXParser().parse(mis, this);

} catch ( IOException | ParserConfigurationException | SAXException e) {

   e.printStackTrace();

} finally {

   progressBar.setVisible(false); // Again please call this in swing event thread

}

在我的情况下，进度从左到右很好地上升，没有异常跳跃。调整阈值以在性能和响应能力之间取得最佳平衡。太小了，在小型设备上阅读速度可以翻倍以上，太大，进度不顺畅。

希望能帮助到你。如果您发现错误或拼写错误，请随时编辑，或投票给我一些鼓励！:D

score 10 · Accepted Answer

10

用一个javax.swing.ProgressMonitorInputStream.

于 2010-06-23T10:48:21.040 回答

score 2 · Accepted Answer

您可以通过覆盖setDocumentLocator. org.xml.sax.helpers.DefaultHandler/BaseHandler使用一个对象调用此方法，您可以在需要时从中获取当前行/列的近似值。

编辑：据我所知，没有获得绝对位置的标准方法。但是，我确信某些 SAX 实现确实提供了这种信息。

score 1 · Accepted Answer

假设你知道你有多少文章，你不能在处理程序中保留一个计数器吗？例如

public void startElement (String uri, String localName, 
                          String qName, Attributes attributes) 
                          throws SAXException {
    if(qName.equals("article")){
        counter++
    }
    ...
}

（不知道你是不是在解析“文章”，只是举例）

如果事先不知道文章的数量，则需要先统计。然后您可以打印状态nb tags read/total nb of tags，例如每 100 个标签 ( counter % 100 == 0)。

甚至让另一个线程监视进度。在这种情况下，您可能希望同步对计数器的访问，但没有必要，因为它不需要非常准确。

我的 2 美分

score 0 · Accepted Answer

我会使用输入流位置。制作您自己的微不足道的流类，它从“真实”类委托/继承并跟踪读取的字节。正如您所说，获取总文件大小很容易。我不会担心缓冲、前瞻等 - 对于像这样的大文件，它是鸡饲料。另一方面，我会将头寸限制为“99%”。

java - Java SAX 解析器进度监控

5 回答 5

Related

Reference