Estimated costs to train GPT-4 range from $60 to $100 million USD. But will it always take millions of dollars to train a new foundation language model? There are (at least) three different forces in play, and they are pulling in different directions.
(1) The predominant factor is “bigger is better”. The more parameters in a model, the better it performs. So far, there doesn’t seem to an upper limit. GPT-4 is reported to have 100 trillion parameters, which takes in aggregate centuries of compute time (highly parallelizable, thankfully) to crunch. The effective limit may be extant human knowledge. Once the training set includes the entire internet, every book ever written, etc., it may be difficult to find useful and novel training data. Legal challenges, regulations, and copyright infringement concerns may also impose a damper on the availability of training material.
(2) A secondary factor is optimization. Work-to-date has largely focused on figuring out what’s possible, not on slimming down needed resources. I’m confident there are many orders of magnitude improvements waiting to be discovered. Though, even if things get 1000x easier, working with 100 trillion of anything is still a heavy lift. Another branch of optimizations will be on delta training. Instead of rebuilding huge models from scratch, it will be possible to train on differential or new data only, which will be much more practicable, including for public/open source models. I wouldn’t be surprised to see the Apache Foundation, Mozilla, or other groups sponsor freely available foundation models.
(3) The third factor is the least important, though in the world of general computing, it has been perhaps the most significant. Moore’s Law. Despite repeated predictions, we haven’t quite reached the end of improvements in transistor speed/cost. Expect to see (even more) specific hardware optimized for generative AIs–GPUs are not the end-all-be-all.
Taken as a whole, expect to see models getting bigger and more expensive in the short term until we somehow top out on “bigger is better.” After that, there will be impressive wave after wave of hardware and software optimizations that bring the cost down. A decade out will be wild. Will we be carrying around GPT-level models in our pockets?
This post originally was posted on LinkedIn. 100% free-range human written.