Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Recurrency will help, see the empirical results in this paper: https://arxiv.org/abs/2207.02098. LSTMs can solve the PARITY problem, generalising to longer lengths of the problem without issue.

The issue with re-introducing recurrency is that no one wants to bear the O(N) growth in computation time. I wonder if we are settling in for a local-optima of bounded-depth architecture for a while due to the obsession with rapid scaling.



You don't necessarily need to move to depth N, just because you went recurrent.

You can probably design a modular network that can implement the logarithmic version.


I guess I should say the recurrence growing as a function of the input size is something people don't seem to be willing to afford, not specifically the linear dependence.

Not to mention parity here is a toy problem to highlight the issue, but there can be other problems that require recurrence linear or quadratic in the size of the input.


Sure, but a toy problem that's substitutable for many other functions.

In this case you could substitute the XOR function for any other merge function.

As far as the variable depth is concerned, utilisation is an issue that introduces additional complexity, but otherwise I don't see the problem.

(No one's expecting fixed compute time for differently sized data elements.)

You can get around that one by building a log(n) threaded pipeline for your log(n) depth network (something like this: https://imgur.com/a/xL4rhFu).


The RWKV people seem to have adapted a recursive model so it can be trained in a special parallel mode like attention.

https://github.com/BlinkDL/RWKV-LM


Yeah I really like this project, and I've been meaning to dive into it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: