push and pop are pseudo instructions to the assembler they are not real instructions. You either get a store with the base register updated an stm.
push {r11}
stmdb r13!,{r11}
push {r10-r12}
stmdb r13!,{r10-r12}
I prefer stmdb to stmfd just different syntax for the same instruction. (stmdb and ldmia make sense to me, decrement before and increment after).
assemble then disassemble.
0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
4: e92d0800 stmfd sp!, {fp}
8: e92d1c00 push {sl, fp, ip}
c: e92d1c00 push {sl, fp, ip}
If you look up the stm encoding or even just look at the bits and think about it the upper bits of the instruction 0xe92d are stmia/fd, the lower bits are flags indicating what registers what to be saved, notice at address 4 that is a push of 11, then on 8 and c you have that bit set r11, and then the one below it r10 and the one above it r12.
push and pop are easier to read than trying to remember to use sp and use the ! after the register and remember the ia/db/fd, etc suffix and all that.
I believe that thumb might have an actual push/pop.
The single register variant for arm turned into a single store, doesnt matter if you use an stm with one instruction or an str, the operations are functionally equivalent.
So long as you update r13 after the operation and you use db or fd for the stm the you can use the pseudo instruction or the real instructions.
if you are going to store/restore more than one register then definitely list them in a single instruction, dont make a list of several pushes or pops
no:
push {r10}
push {r11}
push {r12}
yes:
push {r10-r11}
Unless on thumb then you might not have a choice as you can only push r0-r7+r14 and pop r0-r7+r15 to save higher registers you have to copy them down into lower registers then use push. and you have to use push the stm wont let you use r13. (thumb2 depending on what extensions are available to your architecture, give you more of an arm-like experience).
re-reading your question
sp is r13, the stack pointer. the pseudo instruction chooses the right instructions so you dont need to worry about stm vs str. When you store more than one register you "can" get an optimization on modern arm systems, but not guaranteed. If your amba/axi bus is 64 bits wide it is more than 2 times faster to write 64 bits at a time rather than 32 bits at a time, because on a 64 bit memory system it takes a read-modify-write to do a 32 bit write, but a 64 bit write does not (lets ignore the cache behavior). If the stm is on an aligned address (when using the stack it would take too much code to figure that out, dont worry about it) then a push of 2 registers would be noticeably faster than two separate pushes (unless the core optimizes those into one bus cycle). If you push say 4 registers one of three things happens if unaligned then you get three transfers a 32 bit transfer on the unaligned address (lets say 0x1004), then a 64 bit transfer on the aligned address after that (0x1008), then a 32 bit transfer of the last register (0x1010). If that four register push had been on analigned address then one of two things happens either two separate 64 bit transfers two registers to 0x2010 lets say and two to 0x2018 or a length of 2 transfer (two 64 bit items in an single transfer) at the aligned base address, say 0x2010. You wont get the worst case though which is four individual 32 bit transfers, so it is worth using the stm/push.