> Our approach requires an expensive pre-training step - 1 month on 8 GPUs. Wow!...

sherjilozair · on June 12, 2018

You pretty much always need at least 2 GPUs, one to keep running jobs for 30 minutes or so and debugging, and the other for longer jobs. It also takes a lot of patience to only make ONE change at a time. Often, changes you make which feel intuitive would actually hurt performance, so it's important to verify that each new change is actually improving performance.

akhilcacharya · on June 12, 2018

I'm guessing they were running a bunch of experiments in parallel and killed ones that weren't yielding good results in the first few hours/days/weeks - I doubt they are wanting for GPU resources unlike me!

madisonmay · on June 12, 2018

The critical thing to note is that although the initial language model takes a month to train on 8 GPUs, fine-tuning the language model for a specific task is much much cheaper (a few hours on a single GPU).

nl · on June 12, 2018

ULMFiT gets most of this performance in ~5 hours on my GTX1070.