Recurrency will help, see the empirical results in this paper: https://arxiv.org/abs/2207.02098. LSTMs can solve the PARITY problem, generalising to longer lengths of the problem without issue.
The issue with re-introducing recurrency is that no one wants to bear the O(N) growth in computation time. I wonder if we are settling in for a local-optima of bounded-depth architecture for a while due to the obsession with rapid scaling.
I guess I should say the recurrence growing as a function of the input size is something people don't seem to be willing to afford, not specifically the linear dependence.
Not to mention parity here is a toy problem to highlight the issue, but there can be other problems that require recurrence linear or quadratic in the size of the input.
The issue with re-introducing recurrency is that no one wants to bear the O(N) growth in computation time. I wonder if we are settling in for a local-optima of bounded-depth architecture for a while due to the obsession with rapid scaling.