To what extent is an LLM necessary here? As far as I can tell (and perhaps I haven't looked closely enough yet) the purpose of the LLM here is to generate things that look plausibly like python functions conforming to a given type signature.
But it should be possible to generate random, correct python functions conforming to a given type signature without an LLM. This would be an exercise like [1], just with a substantially more complex language. But might a restricted language be more ergonomic? Something like PushGP [2]?
So I guess my questions would be:
(1) What's the value add of the LLM here? Does it substantially reduce the number of evaluations necessary to converge? If so, how?
(2) Are other genetic programming techniques less competitive on the same problems? Do they produce less fit solutions?
(3) If a more "traditional" genetic programming approach can achieve similar fitness, is there a difference in compute cost, including the cost to train the LLM?
> it should be possible to generate random, correct python functions conforming to a given type signature without an LLM
The state space of viable programs is far, far larger than useful ones. You need more than monkeys and typewriters. The point of using Palm2 here is that you don’t want your candidates to be random, you want them to be plausible so you’re not wasting time on nonsensical programs.
Further, a genetic algorithm with random program generation will have a massive cold start problem. You’re not going to make any progress in the beginning, and probably ever, if the fitness of all of your candidates is zero.
The fact that LLMs generate text which is better than random is not something the authors need to spell out. Reducing the cross entropy between predicted and target distributions is the entire purpose of training.
Put another way: a flat probability distribution from an untrained network is equivalent to the random decoding you’re skeptical an LLM is better than. That’s not a criticism the authors’ peers would likely put forward.
I’m not sure why you felt the need to correct me. Target is the correct terminology. As a sanity check I even looked up the PyTorch doc which explicitly uses “target” in their example.
Because the target distribution is not known, whatever PyTorch docs say. The test distribution is known and nobody knows what's its relation to the target distribution except a hope and a prayer.
Same paper you are. This is comparing apples to oranges. Their "random" strategy is only capable of assembling very constrained python programs whereas the LLM is coming up with things that look like the sort of stuff a person might write, utilizing the whole language (modulo hallucinatory nonsense). So it's not a reasonable comparison at all.
Instead, for the purpose of making something resembling a reasonable comparison, it would make a lot more sense to train an LLM on programs written in the same language they allow their "random" strategy to use. I wouldn't recommend trying this in python, as it's way too big a language. Something smaller and more fit for purpose would likely be much more tractable. There are multiple examples from the literature to choose from.
EDIT: To be clear--it's obvious the LLM solution outperformed the hand-rolled solution. By a large margin. What would actually be an interesting scientific question, though, is to what extent is the success attributable to the LLM? That's not possible to answer given the results, because the hand rolled-solution and the LLM solution are literally speaking different languages.
The purpose of the ablation is to demonstrate the value-add of using an LLM. Comparing to another type of candidate generator wouldn’t be ablation anymore, it would just be another study.
As I have mentioned previously, using an LLM addresses the cold start problem by immediately generating plausible-looking programs. The reason random token selection is limited to constrained programs is that valid programs are statistically unlikely to occur.
And the ablation directly demonstrates that the LLM is better than random.
Genetic programming algorithms don't start cold just as FunSearch doesn't start cold. It starts with a partial Python program with "missing" lines that the system has to fill-in. That's pretty much schema-guided program synthesis, to a t, down to the generate-and-test approach.
Unfortunately you won't find any references to that kind of earlier work in DeepMind's paper (as in most of their papers), so here's a paper I found on the internets with six second search that explains what that is:
Intuitively, a program schema is an abstraction of a class of actual programs,
in the sense that it represents their data-flow and control-flow, but neither con-
tains all their actual computations nor all their actual data structures. Program
schemas have been shown to be useful in a variety of applications. In synthesis,
the main idea is to simplify the proof obligations by taking the difficult ones
offline, so that they are proven once and for all at schema design time. Also, the
reuse of existing programs is made the main synthesis mechanism
The original question was “why would they use an LLM here”. Whether candidates are an entire program or just a portion of the program is immaterial to that question. In fact, it is an example of a situation where random decoding is especially poorly matched against an autoregressive model, since there’s already context to use.
Anyways, you can see there is a cold start issue in the ablation study since the random strategy didn’t even begin to register until about halfway through the test.
Of course if you start without any inductive biases (i.e. a program schema in the case of the FunSearch approach) there will be a "cold start" problem, but FunSearch doesn't start cold, as genetic programming doesn't start cold.
The question is the "cold start issue". That's what I'm addressing. Maybe you're not used to being corrected? Get used to it.
The found function is in here: https://github.com/google-deepmind/funsearch/blob/main/cap_s.... I'm not very familiar with genetic algorithms but it's pretty hard for me to imagine that they couldn't come up with this. I'd be surprised if too many people (if any) have tried.
On the other hand, as observed in Appendix A.2 of this paper, the non-LLM genetic approach would have to be engineered by hand more than the LLM approach.
I guess what I'm really trying to get at is this seems like a huge missed opportunity to show their LLM is actually doing something cool. Like if it somehow significantly outperforms established genetic programming update strategies that should be pretty easy to demonstrate given their experimental setup, but no attempt was made. That's kinda bizarre... like, in what sense is this an advancement of the field?
They need to show something to keep the excitement around Google's upcoming AI alive. Expect a slew of (non-)discoveries to continue unabated going forward.
Genetic algorithms, even with programmed constraints, will come up with many non-sensical programs, although they could be mostly syntactically correct (given sufficient effort).
The difference LLMs make here is to limit the space of possible mutations to largely semantically plausible programs.
The above should answer your 1) and 2).
On 3), a trained LLM is useful for many, many purposes, so the cost of training it from scratch after amortization wouldn't be significant. It may cost additionally to fine-tune an LLM to work in the FunSearch framework but fine-tuning cost is quite minimal. Using it in the framework is likely a win over genetic programming alone.
> The difference LLMs make here is to limit the space of possible mutations to largely semantically plausible programs.
Yet, sadly, the work under question here didn't make any attempt to show this, AFAICT. It's an interesting hypothesis, but as yet untested.
> a trained LLM is useful for many, many purposes, so the cost of training it from scratch after amortization wouldn't be significant. It may cost additionally to fine-tune an LLM to work in the FunSearch framework but fine-tuning cost is quite minimal. Using it in the framework is likely a win over genetic programming alone.
Again, an interesting hypothesis that probably could be investigated with the experimental apparatus described in the paper. But it wasn't. Have you?
>> The difference LLMs make here is to limit the space of possible mutations to largely semantically plausible programs.
> Yet, sadly, the work under question here didn't make any attempt to show this, AFAICT. It's an interesting hypothesis, but as yet untested.
It would be ideal to do that. But given the way LLMs work, it is almost a given. This could be the reason it didn't occur to the researchers who are very familiar with LLMs.
>> a trained LLM is useful for many, many purposes, so the cost of training it from scratch after amortization wouldn't be significant. It may cost additionally to fine-tune an LLM to work in the FunSearch framework but fine-tuning cost is quite minimal. Using it in the framework is likely a win over genetic programming alone.
> Again, an interesting hypothesis that probably could be investigated with the experimental apparatus described in the paper. But it wasn't. Have you?
Perhaps it's quite evident to people in the trench, as the hypothesis seems very plausible to me. A numerical confirmation would be ideal as you suggested.
Given my management experience, I'd say it's smart of them to focus their limited resources on something more innovative in this exploratory work. Details can be left for future work or worked out by other teams.
Limited resources: time of highly competent researchers, management time
Science takes time. A single paper needs not completely answer everything. Also, in many fields, it’s common to find out what works before why it works.
Inductive program synthesis stood still for decades because the space is far too large to get anything beyond absolutely trivial programs; LLMs vastly trim (in many wrong ways too of course) the search space and then you can apply ips to finetune and test. You cannot do this (not in any way we know of) without LLMs because it just tests billions of programs that make absolutely no sense even for trivial cases.
Quantify "vastly trim" for me please. All I see is a firehose of compute that spams and spams millions of samples until it hits a lucky guess. That's not trimming anything, it's the computing equivalent of the human wave attack.
But, please quantify "vastly trim".
Edit: the subject of my PhD research was an inductive program synthesis approach and while I agree with the bit about the size of program search spaces, LLMs are very clearly not a solution to that. Deepmind's "solution" is instead their massive computational capacity which makes it possible to search deeper and wider in the search space of Python programs.
FunSearch is like one of those rocket cars that people make once in a while to break land speed records. Extremely expensive, extremely impractical and terminally over-specialised to do one thing, and do that thing only. And, ultimately, a bit of a show.
Well the approach we use (and has been used at MS research before gpt was a thing) is to ask the llm to provide guesses for instructions to get to a (sub) solution and then let ips (in prolog in our case) do the rest. It seems to work for what we are doing.
Disclaimer: did my dissertation on ips (with genetic programming), but in the 90s when we basically had no computer power. The results were horrible outside trivialities, now they aren't anymore.
Well, if you're willing to use Prolog I commend you, but then you don't need an LLM because you don't need to do a dumb search of a large space. You can instead use second-order Resolution for induction. Here's how it's done (a.k.a. shameless plug of my PhD work):
>> Disclaimer: did my dissertation on ips (with genetic programming), but in the 90s when we basically had no computer power. The results were horrible outside trivialities, now they aren't anymore.
But that's because the training data and the model parameters keep increasing. Performance keeps increasing but not because capabilities improve, only because there's more and more resources spent on the same old problems. There's better ways than that. All the classical AI problems, planning, verification, SAT, now have fast solutions thanks to better heuristics, not thanks to more data (that are useless to them anyway). That's a real advance.
I still would like a quantification of the "vast trim" btw. Numbers, or it didn't happen.
Interesting. Do you have a transpiler from Prolog to Python-subset? I would have preferred to use the LLM to do the translation, instead, and let the Prolog-based IPS (is it based on bottom clause construction?) do what it does best.
I'm working on a paper about this but I guess that won't be "layman terms". Let me try to give a simple explanation. I'm assuming "layman" means "layman programmer" :)
In the simplest of terms that I think can be understood by a layman programmer then, second-order SLD-Resolution in Meta-Interpretive Learning (the method I discuss above) doesn't need to search for a program in a large program space because it is given a program, a higher-order logic program that is specialised into a first-order program during execution.
Without having to know anything about the execution of Prolog programs you can think of how a program in any programming language is executed. The execution doesn't need to search any program space, it only needs to calculate the output of the program given its input. The same goes for the execution of a Prolog program, but here the result of the execution is a substitution of the variables in the Prolog program. In the case of a higher-order program, the substitution of variables turns it into a first-order program (that's what I mean when I say that the higher-order program is "specialised").
Here's an example (using Louise, linked above). The following are the training examples and first- and second-order background knowledge to learn the concept of "even". The terminology of "background knowledge" is peculiar to Inductive Logic Programming, what it really refers to is the higher-order program I discuss above:
If you squint a bit you'll see that the clauses that make up this first-order program are the clauses of the second-order "background knowledge", with their existentially quantified variables substituted for predicate symbols: 'even', 'zero' and 'prev'. Those symbols are taken from the first-order background knowledge (i.e. the first-order subset of the higher-order program) and the examples. There is an extra symbol, '$1', which is an invented predicate symbol, not found in the background knowledge or examples and instead automatically generated during execution. If you squint a bit harder you'll see that the clause '$1'(A):-prev(A,B),even(B). with that symbol in its head is a definition of the concept of "odd". So Louise learned the concept of "even" by inventing the concept of "odd".
The natural question to ask here is, I suspect, couldn't you do the same thing just by filling-in the blanks in the second-order clauses in the background knowledge, as if they were dumb templates? You could, and in fact that's the done thing in inductive program synthesis. But that's when you end up searching the space of all logic programs (i.e. the space of sets of first-order clauses). There's a whole bunch of programs that you could construct that way and you'd have to decide which to keep. To do that you'd have to construct each of them, so you'd have to construct the powerset of all constructible clauses. Powersets grow exponentially with the size of their base set, leading to combinatorial explosion. The win of using SLD-Resolution is that it only needs to find substitutions of variables, which can be done efficiently, and it only returns those substitutions of the variables in the second-order program that prove the positive examples (and disprove the negative ones). So it's always right (SLD-Resolution is sound and complete).
>> I’m also curious if this scales to real world problems.
Depends on what you mean "real world problems". The largest program I've learned with Louise is a 2500-clause program, but that's for a dumb problem that only serves as a benchmark (the problem is to find every way to go from a point A to a point B on an empty grid-world; it's surprisingly combinatorially hard and other methods die before they get anywhere near the 2500 clause mark; because the program search space blows up immediately).
In truth, real-world Prolog programs rarely need to be much bigger than a handful of clauses, like five or six. At that point, like in any programming language, you break your program up into smaller sub-programs, and those can be learned easily enough (and not just by second-order SLD-Resolution, to be fair). The challenge is to figure out how to do this breaking-up automatically. Meta-Interpretive Learning with Second-Order SLD-Resolution can do it up to a point, with predicate invention, as in the example above and also because any program learned can go directly into the "background knowledge" to be reused in a new learning session. But that still hasn't been done in a systematic manner.
Right now I'm working on a project to grant autonomous behaviour to a robot boat used in search-and-rescue missions. This is a fairly large problem and there are certainly challenges to do with efficiency. In fact efficiency is the main challenge. But from where I'm standing that's an engineering challenge.
This is not correct about the current state of the art in program synthesis, which is a field that's made a lot of progress recently. Unfortunately people coming from the AI world tend to ignore that work entirely and imagine they're the only people with new ideas since the 80s.
> (1) What's the value add of the LLM here? Does it substantially reduce the number of evaluations necessary to converge? If so, how?
I thought that stochastic (random) gradient descent and LLM were converging much quicker than genetic programming. Definitely much quicker than random search.
But it should be possible to generate random, correct python functions conforming to a given type signature without an LLM. This would be an exercise like [1], just with a substantially more complex language. But might a restricted language be more ergonomic? Something like PushGP [2]?
So I guess my questions would be:
(1) What's the value add of the LLM here? Does it substantially reduce the number of evaluations necessary to converge? If so, how?
(2) Are other genetic programming techniques less competitive on the same problems? Do they produce less fit solutions?
(3) If a more "traditional" genetic programming approach can achieve similar fitness, is there a difference in compute cost, including the cost to train the LLM?
[1] http://www.davidmontana.net/papers/stgp.pdf [2] https://faculty.hampshire.edu/lspector/push.html