Recurrency will help, see the empirical results in this paper: https://arxiv.org...

yarg · on April 18, 2023

You don't necessarily need to move to depth N, just because you went recurrent.

You can probably design a modular network that can implement the logarithmic version.

shawntan · on April 18, 2023

I guess I should say the recurrence growing as a function of the input size is something people don't seem to be willing to afford, not specifically the linear dependence.

Not to mention parity here is a toy problem to highlight the issue, but there can be other problems that require recurrence linear or quadratic in the size of the input.

yarg · on April 18, 2023

Sure, but a toy problem that's substitutable for many other functions.

In this case you could substitute the XOR function for any other merge function.

As far as the variable depth is concerned, utilisation is an issue that introduces additional complexity, but otherwise I don't see the problem.

(No one's expecting fixed compute time for differently sized data elements.)

You can get around that one by building a log(n) threaded pipeline for your log(n) depth network (something like this: https://imgur.com/a/xL4rhFu).

_0ffh · on April 18, 2023

The RWKV people seem to have adapted a recursive model so it can be trained in a special parallel mode like attention.

https://github.com/BlinkDL/RWKV-LM

shawntan · on April 18, 2023

Yeah I really like this project, and I've been meaning to dive into it.