You really don't want to copy the data at all.  Allocating storage for and copying a large chunk of data can take long enough to kill your frame rate.  This generally rules out byte[] and ByteBuffer[] solutions, even if you didn't have to do a U/V plane swap.
The most efficient way to move data through the system is with a Surface.  The trick is that a Surface isn't a buffer, it's an interface to a queue of buffers.  The buffers are passed around by reference; when you unlockCanvasAndPost() you're actually placing the current buffer onto a queue for the consumer, which is often in a different process.
There is no public mechanism for creating a new buffer and adding it to the set used by the queue, or for extracting buffers from the queue, so you can't implement a DIY buffering scheme on the side.  There's no public interface to change the number of buffers in the pool.
It'd be useful to know what it is that's causing the hiccups.  The Android tool for analyzing such issues is systrace, available in Android 4.1+ (docs, example, bigflake example).  If you can identify the source of the CPU load, or determine that it's not CPU but rather some bit of code getting tangled up, you'll likely have a solution that's much easier than adding more buffers to Surface.