However, the NES seems to have either 2 or 3 byte opcodes depending on what addressing mode the CPU is in. I can't think of any easy way to figure out how many bytes I need to read for each opcode.
The opcode is still only one byte. The extra bytes specify the operands for those instructions that have explicit operands.
To do the decoding, you can create a switch
-block with 256 cases (actually it won't be 256 cases, because some opcodes are illegal). It could look something like this:
opcode = ReadByte(PC++);
switch (opcode) {
...
case 0x4C: // JMP abs
address = ReadByte(PC++);
address |= (uint16_t)ReadByte(PC) << 8;
PC = address;
cycles += 3;
break;
...
}
The compiler will typically create a jump table for the cases, so you'll end up with fairly efficient (albeit slightly bloated) code.
Another alternative is to create an array with one entry per opcode. This could simply be an array of function pointers, with one function per opcode - or the table could contain a pointer to one function for fetching the operands, one for performing the actual operation, plus information about the number of cycles that the instruction requires. This way you can share a lot of code. An example:
const Instruction INSTRUCTIONS[] =
{
...
// 0x4C: JMP abs
{&jmp, &abs_operand, 3},
...
};
I'm also having trouble with figuring how to count cycles. How do I create a clock within a programming language so that everything is in sync?
Counting CPU cycles is just a matter of incrementing a counter, like I showed in my code examples above.
To sync video with the CPU, the easiest way would be to run the CPU for the amount of cycles corresponding to the active display period of a single scanline, then draw one scanline, then run the CPU for the amount of cycles correspond to the horizontal blanking period, and start over again.
When you start involving audio, how you sync things can depend a bit on the audio API you're using. For example, some APIs might send you a callback to which you respond by filling a buffer with samples and returning the number of samples generated. In this case you could calculate the number of CPU cycles that have been emulated since the previous callback and determine how many samples to generate based on that.
On an unrelated side note, since the NES is little-endian, do I need to read programCounter + 1 and then read programCounter to get the correct opcode?
Since the opcode is a single byte and instructions on the 6502 aren't packed into a word like on some other CPU architectures, endianness doesn't really matter. It does become relevant for 16-bit operands, but on the other hand PCs and most mobile phones are also based on little-endian CPUs.