From the Go FAQ:
Why doesn't my multi-goroutine program use multiple CPUs?
You must set the GOMAXPROCS shell environment variable or use the similarly-named function of the runtime package to allow the run-time support to utilize more than one OS thread.
Programs that perform parallel computation should benefit from an increase in GOMAXPROCS. However, be aware that concurrency is not parallelism.
(UPDATE 8/28/2015: Go 1.5 is set to make the default value of GOMAXPROCS the same as the number of CPUs on your machine, so this shouldn't be a problem anymore)
And
Why does using GOMAXPROCS > 1 sometimes make my program slower?
It depends on the nature of your program. Problems that are intrinsically sequential cannot be sped up by adding more goroutines. Concurrency only becomes parallelism when the problem is intrinsically parallel.
In practical terms, programs that spend more time communicating on channels than doing computation will experience performance degradation when using multiple OS threads. This is because sending data between threads involves switching contexts, which has significant cost. For instance, the prime sieve example from the Go specification has no significant parallelism although it launches many goroutines; increasing GOMAXPROCS is more likely to slow it down than to speed it up.
Go's goroutine scheduler is not as good as it needs to be. In future, it should recognize such cases and optimize its use of OS threads. For now, GOMAXPROCS should be set on a per-application basis.
In short: it is very difficult to make Go use "efficient use of all your cores". Simply spawning a billion goroutines and increasing GOMAXPROCS is just as likely to degrade your performance as speed it up because it will be switching thread contexts all the time. If you have a large program that is parallelizable, then increasing GOMAXPROCS to the number of parallel components works fine. If you have a parallel problem embedded in a largely non-parallel program, it may speed up, or you may have to make creative use of functions like runtime.LockOSThread() to ensure the runtime distributes everything correctly (generally speaking Go just dumbly spreads currently non-blocking Goroutines haphazardly and evenly among all active threads).
Also, GOMAXPROCS is the number of CPU cores to use, if it's greater than NumCPU I'm fairly sure that it simply clamps to NumCPU. GOMAXPROCS isn't strictly equal to the number of threads. I'm not 100% sure of exactly when the runtime decides to spawn new threads, but one instance is when the number of blocking goroutines using runtime.LockOSThread() is greater than or equal to GOMAXPROCs -- it will spawn more threads than cores so it can keep the rest of the program running sanely.
Basically, it's quite simple to increase GOMAXPROCS and make go use all cores of your CPU. It's quite another thing at this point in Go's development to actually get it to smartly and efficiently use all cores of your CPU, requiring a lot of program design and finagling to get right.