It is stored in a register on some architectures like ARM. However, that register gets spilled to the stack to store another return address there when calling another level deeper. It doesn't change much. It does make it easier to implement return address control flow integrity that's not vulnerable to a race window between the CFI check and the return.
There have been machines with a separate return address stack in on-chip hardware. Forth CPUs were built that way, as was a National Semiconductor part used for running embedded BASIC. Running out of return point stack was a problem, since those 1980s machines were transistor-limited and came with small return stack sizes.
PICs are still popular and have hardware return stacks.
Modern high-end CPUs have hardware return stacks too, but only as a hint to the branch predictor of where a ret instruction will jump to (return stack buffer).
Separately... there are exploit mitigations that create a separate stack just for return addresses, making them impossible to reach through stack buffer overflows. For a recent implementation, see Clang's SafeStack:
There are serious limitations to this approach, though: there's a lot of important data on the stack other than return addresses, and overwriting it is often enough for an attacker to redirect control flow eventually, just more indirectly.
It has to get put on the stack at some point so you can call more than 1 function deep. So why not always put it on the stack so that you don't waste a valuable register?
The answer to "why not always put it on the stack" is "because a lot of functions are leaf functions and so always writing it to the stack is making every function pay the memory access hit rather than just the ones that need it". RISC-ish architectures tend to have enough registers that dedicating one to a link pointer isn't a big deal (and once you do spill it to the stack you can use the link register as a temporary register anyway).
Some very early CPU architectures didn't actually support either putting the return address in a register or on the stack. For instance, on the PDP-8 (https://en.wikipedia.org/wiki/PDP-8#Subroutines) the JMS instruction writes the return address to the first word of the subroutine it's about to call (and the actual subroutine entry point is just after that), which meant it didn't conveniently support recursion. It wasn't alone in that either -- I think that it just wasn't quite appreciated how important recursion/reentrancy was back in the early 60s when these ISAs were designed.
Registers aren't all that valuable on architectures with reasonable numbers of them, and a lot of architectures do "branch and link" instead of an x86-style call. Branch and link generally means that control flow jumps elsewhere and the address of the next instruction is stored in a register. You jump back to that register to return. Functions are responsible for saving the link register if they clobber it.
This has at least one benefit over x86-style calls: a function like this:
void foo(void)
{
for (int i = 0; i < 10; i++)
some_leaf_function();
}
has to save its own return address to the stack, but it only needs to save it once, so all ten leaf calls can happen without stack access for the return address.
Of course, architectures like x86 have specialized hardware to optimize calls, so it's probably a wash in the end.