Few notes on machine intelligence

Posted on Sep 27, 2024

“Are LLMs intelligent? Are they on the way to being intelligent?” were strange questions for me. My usual answer was “What even is intelligence? Until we define intelligence, we cannot have this talk.” Unsatisfied with my ignorance, I went on a hunt for answers on the reasoning and abstraction abilities of LLMs and definitions and measures of intelligence.

Take for example the “How Good Is ChatGPT at Coding, Really?” study published by IEEE (arxiv paper):

Paraphrased: “On coding problems that existed before 2021 ChatGPT was able to produce functional code for easy, medium, and hard problems with success rates of about 89, 71, and 40 percent, respectively. … However, when it comes to algorithm problems after 2021, ChatGPT’s ability to produce functional code for “easy” coding problems dropped from 89 percent to 52 percent after 2021. Its ability to generate functional code for “hard” problems dropped from 40 percent to 0.66 percent after this time as well. "

Another study “Task Contamination: Language Models May Not Be Few-Shot Anymore” . They examined closed and opened models and their performance on datasets before and after the release of a given model: " … We find evidence that some LLMs have seen task examples during pre-training for a range of tasks, and are therefore no longer zero or few-shot for these tasks. Additionally, we find that for classification tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings. …”

How I understand transformers and results from similar studies, current approaches are not able to generalize well. Can we define reasoning? Can we define intelligence? I found Francois Chollet’s (whom I know as a main developer of the Keras framework) paper “On the Measure of Intelligence” which shed a lot of light on the topic. Go ahead and read it : ). Important highlights from this:

Formal definition of intelligence: “The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty”
There is some amount of core knowledge (priors), “…it includes basic concepts like what makes an object, counting, geometry, topology, symmetries, etc. …”
Skill is “basically” the result of intelligence
Introduction of ARC (Abstraction and Reasoning Corpus) dataset, which tries to attack memorizing models and highlight learning ones. On the webpage, you can try demo puzzles yourself. There is even a prize pool of 1M$
Efficiency is a crucial part of intelligence, not only getting to the skill for whatever cost

The process of training the machine-learning model is the display of intelligence. The responses during inference are the displays of the learned skills. If you don’t like academic materials very much, Francois was on Dwarkesh Patel’s podcast. My highlights from the podcast:

Great description of what LLMs are under the hood: “Huge database of learned templates that tries to match onto the prompt we ask. Cannot learn at inference time, and learns inefficiently.”
OpenAI killed open research even though benefiting from it. Gemini paper from Google also did not disclose anything. The closed race is on.
The goal of the ARC dataset is to accelerate the path towards AGI with work that is public. No sharing, no human progress.
Note from Francois “I use coding helpers not that much - because I solve problems and writing solutions into code is just a small portion of what I do.” feels valid for academic/frontier work, but not so much for the industry setting.

If you are interested how latest models perform on ARC, here is the latest update..

What do you think? Are LLMs on the path to AGI? Is it “just a scaling issue”? Can they reason and abstract?