I honestly can't see why LLMs should be good at this sort of thing. I am convinced you need a completely different approach. At the very least you mostly only want one completely correct result. Good luck getting current models to do that.
LLMs aren't totally out of scope of mathematical reasoning. LLMs roughly do two things, move data around, and recognize patterns. Reasoning leans heavily on moving data around according to context-sensitive rules. This is well within the scope of LLMs. The problem is that general problem solving requires potentially arbitrary amounts of moving data, but current LLM architectures have a fixed amount of translation/rewrite steps they can perform before they must produce output. This means most complex reasoning problems are out of bounds for LLMs so they learn to lean heavily on pattern matching. But this isn't an intrinsic limitation to LLMs as a class of computing device, just the limits of current architectures.
One core issue is that we need to convert spoken/written languages (e.g. english) into more formal math languages since sometimes the underlying mathematical problem is written using prose. The example in the paper:
> When Sophie watches her nephew, she gets out a variety of toys for him. The bag of building blocks has 31 blocks in it. The bin of stuffed animals has 8 stuffed animals inside. The tower of stacking rings has 9 multicolored rings on it. Sophie recently bought a tube of bouncy balls, bringing her total number of toys for her nephew up to 62. How many bouncy balls came in the tube?
So I would argue it's critical that LLMs knows how to convert text to math and then perform those math calculations. This extends beyond just math but also the underlying logics.
We just need to figure out how to inform the LLM to read, write, and understand formal languages. My guess is attention heads could probably work in this context, but we might want something that is a little more rigid, naturally extending from the rigidity of logic and formal languages. Conversely, we might not have figured out how to properly train LLMs on formal languages and have them preserve the underlying logic and axioms necessary to correctly perform math calculations.
The recurrent or transformer models are Turing complete, or at least close to being Turing complete (apologies, I’m not sure of the precise terminology here).
As a result, they can at least simulate a brain and are capable of exhibiting human-like intelligence. The "program" is the trained dataset, and we have seen significant improvements in smaller models simply by enhancing the dataset.
We still don’t know what the optimal "program" looks like or what level of scaling is truly necessary. But in theory, achieving the goal of AGI with LLMs is possible.
I'm a math phd student at the moment and I regularly use o1 to try some quick calculations I don't feel like doing. While I feel like GPT-4o is so distilled that it just tries to know the answer from memory, o1 actually works with what you gave it and tries to calculate. It's can be quite useful.
Just earlier today I wanted to check if exp(inx) is an orthonormal basis on L^2((0, 1)) or if it needs normalization. This is an extremely trivial one though. Less trivially I had an issue where a paper claimed that a certain white noise, a random series which diverges in a certain Hilbert space, is actually convergent in some L^infinity type space. I had tried to use a Sobolev embedding but that was too crude so it didn't work. o1 correctly realized that you have to use the decay of the L^infinity norm of the eigenbasis, a technique which I had used before but just didn't think of in the moment. It also gave me the eigenbasis and checked that everything works (again, standard but takes a while to find in YOUR setting). I wasn't sure about the normalization so again I asked it to calculate the integral.
This kind of adaptation to your specific setting instead of just spitting out memorized answers in commonn settings is what makes o1 useful for me. Now again, it is often wrong, but if I am completely clueless I like to watch it attempt things and I can get inspiration from that. That's much more useful than seeing a confident wrong answer like 4o would give it.