If the shift count can be calculated far in advance then I have two ideas that might work
Using self-modifying code
Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift
Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction
// shift by constant functions
typedef int (*shiftFunc)(int); // the shift function
#define SHL(n) int shl##n(int x) { return x << (n); }
SHL(1)
SHL(2)
SHL(3)
...
shiftFunc shiftLeft[] = { shl1, shl2, shl3... };
int arr[MAX]; // all the values that need to be shifted with the same amount
shiftFunc shl = shiftLeft[3]; // when you want to shift by 3
for (int i = 0; i < MAX; i++)
arr[i] = shl(arr[i]);
This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.
Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD
If the range of the values is small, lookup table is another possible solution
#define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
#define S2(x, n) S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
uint8_t shl[256][8] = {
{ S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
{ S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
...
{ S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
}
Now x << n
is simply shl[x][n]
with x being an uint8_t
. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together