The Nature of Logical Reasoning Models

Recommended post: 【Logic】 Table of Contents for Logic

Recently, AI that uses LLMs to generate hypotheses or perform logical reasoning has become popular. Collectively, these seem to be referred to as AI Scientists. For example, starting with Google DeepMind’s FunSearch, there have been OpenAI o1/o3, Gemini Thinking, Claude Sonnet Thinking, DeepSeek-R1, AI-Descartes, Theorizer, and more. Recently, using LLM agents, AutoDiscovery, The Virtual Lab have also been announced. 2026 is even being called the first year of AI agents, and almost the entirety of Stanford Computer Science is immersed in LLM agent research. However, I would like to examine whether LLM agents are suitable for logical reasoning.

From the logic problems collected since 2015, I curated 23 logic problems for which all propositions obtainable from the given problem situation could be derived accurately and without omission. This problem set contains the questions, answers, and difficulty levels.

After deriving as many answers as possible with 8 solvers, I verified them with 3 verifiers in an attempt to derive the propositions that could be extracted from the given information in an exhaustive and accurate manner. All are implemented by ChatGPT 4o. From this, TP (true positive), FP (false positive), and FN (false negative) were extracted with ChatGPT 5.4 Thinking (TN (true negative), naturally, cannot be extracted), and precision and recall were calculated. The results were as follows.

Originally, I expected scores around 80 to 90, but it felt more like scores in the 40 to 60 range. However, in the particularly difficult goldfish problem (solution), ChatGPT 5.4 Thinking was able to derive the correct answer in under a minute. (Usually, for personnel at the PhD level, this is a problem that takes about 10 to 30 minutes.)

Under the following conditions, who is the person who raises goldfish?

[Condition 1]
There are 5 houses of different colors.
In each house lives a person of a different nationality.
Each house owner drinks a different kind of beverage.
Each house owner smokes a different kind of cigarette.
Each house owner keeps a different kind of pet.

[Condition 2]
The Brit lives in the red house.
The Swede keeps dogs.
The Dane drinks tea.
The green house is located to the left of the white house.
The person in the green house drinks coffee.
The person who smokes Pall Mall keeps birds.
The person in the yellow house smokes Dunhill.
The person living in the center drinks milk.
The Norwegian lives in the first house.
The person who smokes Blend lives next door to the person who keeps cats.
The person who keeps horses lives next door to the person who smokes Dunhill.
The person who smokes Blue Master drinks beer.
The German smokes Prince.
The Norwegian lives next door to the blue house.
The person who smokes Blend lives next door to the person who drinks water.

In the end, I reached the conclusion that logical reasoning AI must be implemented not as a debate among less smart multiple agents, but as a single smarter AI. This is also an insight that I raised earlier while writing the post AI That Generates Hypotheses, The Beginning of That Idea by highlighting the following problem.

There are six balls of the same size and shape.
Among them, three are heavy and the other three are light.
The three heavy balls all weigh the same, and the three light balls all weigh the same.
Using a balance scale three times, distinguish the heavy balls from the light balls.

The point is that when solving this problem, instead of viewing each game as independent and taking only the intersection of the information obtained from them, the information obtained in one game can be passed on to another game, and as the game proceeds in this way, information accumulates and far more diverse conclusions can be produced. In the end, logical reasoning AI is not a simultaneous game but a sequential game, not an open loop but a feedback loop, and this implies that it must be implemented through chain-of-thought. This is also related to reinforcement learning, in which actions change the environment and one continuously receives information from the changed environment to reach superior learning (this is also related to the fact that state-of-the-art LLMs are all trained with reinforcement learning), and frontier labs such as OpenAI are devoting effort to this kind of post-processing work for better reasoning. Test-time compute, which provides the model with extra time to deliberate and a greater reasoning budget, can be viewed in a similar context. And this provides the basis for the lab-in-the-loop approach. It seems that the superiority of sequential games, which we consider intuitive, has been studied less mathematically, so looking into this more deeply may also be meaningful.

Entered: 2026.03.10 14:31

1263

The Nature of Logical Reasoning Models

results matching ""

No results matching ""