> Our approach requires an expensive pre-training step - 1 month on 8 GPUs.
Wow! I enjoy playing with neural networks but this kind of thing reminds me that I'm not really doing deep learning...
I have no idea how researchers could have the patience and confidence to wait that long for a result. In my own (small-data) work, I get frustrated if it doesn't converge in half an hour.. I constantly end up Ctrl-C'ing and tweaking things if it doesn't behave as expected, or appear to be continuing to improve.
You pretty much always need at least 2 GPUs, one to keep running jobs for 30 minutes or so and debugging, and the other for longer jobs. It also takes a lot of patience to only make ONE change at a time. Often, changes you make which feel intuitive would actually hurt performance, so it's important to verify that each new change is actually improving performance.
I'm guessing they were running a bunch of experiments in parallel and killed ones that weren't yielding good results in the first few hours/days/weeks - I doubt they are wanting for GPU resources unlike me!
The critical thing to note is that although the initial language model takes a month to train on 8 GPUs, fine-tuning the language model for a specific task is much much cheaper (a few hours on a single GPU).
Wow! I enjoy playing with neural networks but this kind of thing reminds me that I'm not really doing deep learning...
I have no idea how researchers could have the patience and confidence to wait that long for a result. In my own (small-data) work, I get frustrated if it doesn't converge in half an hour.. I constantly end up Ctrl-C'ing and tweaking things if it doesn't behave as expected, or appear to be continuing to improve.