AIJB

How AI Now Enables Multiple Experts to Collaborate in a Latent Space

How AI Now Enables Multiple Experts to Collaborate in a Latent Space
2025-12-05 journalistiek

amsterdam, vrijdag, 5 december 2025.
Imagine an AI response not emerging from a single model, but from a collaboration of multiple specialized experts—each excelling in their own domain. A new method, Mixture of Thoughts (MoT), enables these experts to work together through a shared latent space without modifying their core models. The result? More comprehensive and accurate answers—even to unfamiliar questions. The most striking achievement: MoT surpasses the best individual models and existing systems, improving performance by nearly 3% on out-of-distribution tasks. This open-source, efficient technique offers a promising path toward reliable, combined intelligence—without repeated interactions or complex intermediate steps.

A New Generation of AI: From One Model to a Team of Experts

Instead of relying on a single large language model (LLM) as a generalist expert, the new approach, Mixture of Thoughts (MoT), introduces a system where multiple specialized models—each focused on a specific domain such as mathematics, coding, or general reasoning—collaborate via a shared latent space. Developed by researchers from the University of California and other institutions, this method allows a lightweight router to select the most suitable experts for each question, without modifying the core models. The experts communicate through interaction layers that project their hidden states into a common space, where the primary expert performs cross-attention on its selected colleagues. This occurs in a single inference step, eliminating the time-consuming iterative exchanges required by earlier systems [1]. The results are compelling: on five in-distribution (ID) benchmarks, MoT exceeds the current standard, Avengers, by +0.38%, and on three out-of-distribution (OOD) benchmarks by +2.92% [1]. This indicates the system not only outperforms existing models on familiar tasks but also significantly excels on unfamiliar, complex questions—crucial for reliable information delivery in realistic settings. The technology is open source, and the code is available on GitHub [1].

From Theory to Practice: Applications in Robotics and Journalism

The power of MoT extends beyond language processing. In a recent study by Peking University, The Chinese University of Hong Kong, and Simplexity Robotics, an application named ManualVLA is demonstrated—an Vision–Language–Action (VLA) model leveraging a Mixture-of-Transformers (MoT) architecture. This system combines a planning expert, which generates multimodal instructions—including text, images, and positional coordinates—with an action expert that executes these commands for robotic manipulation. The planning expert, trained on a digital twin based on 3D Gaussian Splatting, produces realistic intermediate steps without requiring physical data collection [2]. In experiments with a dual-arm Franka Research 3 robot, ManualVLA achieves an average success rate of 95% for 2D LEGO assembly, 90% for 3D LEGO assembly, and 90% for object restoration—a 32% improvement over previous state-of-the-art (SOTA) hierarchical baselines [2]. This performance is enabled by a Manual Chain-of-Thought (ManualCoT) strategy, where each sub-goal is interpreted using both explicit instructions and implicit information embedded in the latent space [2]. For journalists, this means AI systems can not only summarize or analyze more effectively but also generate complex process narratives where each step is logically and visually grounded—such as a report on the manufacturing of a robot or a technological advancement in industry.

A Powerful System, Yet with Limitations and Ethical Risks

Although MoT represents a significant advancement in multi-model AI systems, important limitations and risks remain. The technology depends on the quality of both the tuned experts and the router, which itself requires training. Without proper training, the router may select inappropriate experts or generate unexpected combinations, leading to incorrect or illogical responses. In the ManualVLA study, success rates dropped by 23% under background changes and by 29% under object shape variations—indicating vulnerability to unforeseen perturbations [2]. Similarly, in GR-RL, a related system for autonomous shoelace tying, the use of imported human demonstrations—described by researchers as ‘noisy and suboptimal’—resulted in a baseline success rate of only 45.7% in behavior cloning [3]. Only after data filtering, the addition of symmetry augmentation, and online reinforcement learning did performance rise to 83.3% [3]. This highlights that even advanced systems depend critically on input quality, and outputs cannot be assumed inherently reliable. Ethically, this raises accountability questions: who is responsible for erroneous or misleading summaries arising from an incorrect collaboration of ‘experts’? Furthermore, the open-source availability of the code [1] is a double-edged benefit: it promotes transparency but also exposes the system to misuse or the creation of dangerous variants without adequate oversight.

The Secret of the Latent Space: How Hidden Communication Works

The core of Mixture of Thoughts lies in the ‘latent space’—a virtual space where hidden representations from different LLMs are integrated without altering their structure or weights. Each model retains its original pretraining, meaning it preserves its unique expertise [1]. The router selects the most relevant experts based on a pre-training objective, and lightweight interaction layers perform a cross-attention operation, allowing the primary expert to observe and leverage the activities of active specialists to refine its response. This technique is efficient: inference time is comparable to simple routing systems, avoiding the time-consuming iterative aggregation of answers required by other approaches [1]. The +2.92% improvement on OOD benchmarks [1] is particularly notable, as this solution requires no additional training for new domains—a feature critical for journalists dealing with unexpected, complex challenges such as international crises or technological breakthroughs. By combining explicit and implicit reasoning—as seen in ManualVLA, where both text and latent representations are used—the AI can not only provide answers but also explain the ‘how’ of a process, marking a significant leap in generating in-depth, context-rich narratives [2].

Sources