In a CUDA device, each SM has 64KB of on-chip memory that is placed close to it. By default, this is partitioned into 48KB of shared memory and 16KB of L1 cache. For kernels whose memory access pattern is hard to determine, this partitioning can be changed to 16KB of shared memory and 48KB of L1 cache.
Why doesn't CUDA allow all of the 64KB per-SM on-chip memory to be used as L1 cache?
There are many kinds of kernels which have no use for shared memory, but could use that extra 16KB of L1 cache.