Here's an algorithm based on your improvements that runs only a constant factor longer than is necessary to write out all of the output.
Before beginning, sort within each column from greatest to least and compute suffix sums of the first row (i.e., scan right to left to determine, for each j, the max contribution of columns j..n).
Now do a recursive search with your pruning rule: for the current column, try the values from greatest to least and recurse, breaking out of the row loop if the maximum contribution of the remaining columns plus the sum so far is less than the threshold.
For the analysis, look at the recursion tree. Each leaf is a solution, which, being of size n, pays for every ancestor. There is constant overhead at each node.
If somehow you can consume each solution in sublinear time, perhaps by intertwining producer and consumer, there's another optimization that drops the computation to O(1) per instead of O(n) by detecting when it's necessary to take all of the maximums remaining. This is accomplished by sorting the columns by the gap between the greatest and second-greatest element; then, as soon as there is only one choice for a column going left to right, the others are forced too.