RNN vs Transformers - who knows...?

Transformers are really popular at the moment - with good reason, it must be said - but their memory depth is not arbitrarily long

A limitation of existing Transformer models and their derivatives, however, is that the full self-attention mechanism has computational and memory requirements that are quadratic with the input sequence length. With commonly available current hardware and model sizes, this typically limits the input sequence to roughly 512 tokens, and prevents Transformers from being directly applicable to tasks that require larger context, like question answering, document summarization or genome fragment classification.

Citation: Google AI Blog. ā€˜Constructing Transformers For Longer Sequences with Sparse Attention Methodsā€™. Accessed 15 August 2021. http://ai.googleblog.com/2021/03/constructing-transformers-for-longer.html.

NB Long documents can now be summarised by ā€œrecursive summarizationā€ (summarise by parts, then summarise the summaries, repeat); I suggested this in another context, nice to see others had the same (fairly obvious) idea and run with it!

The memory length of e.g. an LSTM is however arbitrary. The limitation for LSTMs (IF I have understood their modus operandi correctly) is that only a finite number of things can be remembered over that arbitrary period - but the number of things therefore increases ~exponentially with the number of indexes used (cell state = 2^bits)

The key advantage of transformers, as far as I can tell, is the fact that they can be trained with parallel processing, whereas RNNs are limited to sequential training.

Maybe the next thing will be to use LSTMā€™s on the queries of the transformer (to give arbitrary depth on relevance lookups), however, the general question of this post isā€¦

Apart from the training advantages/disadvantages of Transformers vs RNNs, what are the proā€™s and conā€™s of each?

1 Like

Just to add to your post @JulianSMoore - I also think itĀ“s interesting to look into inference performance and speed, especially in the vision domain.

HereĀ“s a paper touching on this area: