Neural Computers
Paper

Neural Computers

Ernie Chang
2026.04.12
·Arxiv·by 이호민/AI
#AI#Computer Architecture#Machine Learning#Neural Computer#Runtime

Key Points

  • 1Neural Computers (NCs) are proposed as a novel machine form that unifies computation, memory, and I/O within a single learned runtime state, distinct from conventional computers or agents, with the long-term goal of a Completely Neural Computer (CNC).
  • 2Initial prototypes, instantiated as video models for CLI (NCCLIGen) and GUI (NCGUIWorld) interfaces, demonstrate the ability to learn fundamental interface primitives, including I/O alignment and short-horizon control.
  • 3Despite these early successes, the paper highlights significant challenges for current NCs, particularly in achieving robust long-horizon reasoning, reliable symbolic processing, and stable capability reuse, outlining these as key areas for future research toward CNCs.

Neural Computers (NCs) are proposed as a novel machine form aiming to unify computation, memory, and I/O within a learned runtime state. Unlike conventional computers executing explicit programs, agents interacting with external environments, or world models learning environment dynamics, NCs endeavor to make the model itself the running computer, learned solely from I/O traces without instrumented program state. The long-term objective is the Completely Neural Computer (CNC), a mature, general-purpose realization with stable execution, explicit reprogramming, and durable capability reuse.

The core methodology instantiates NCs as video models, treating them as learned latent-state systems. Formally, an NC updates its runtime state hth_t and samples the next frame xt+1x_{t+1} using an initial state h0h_0, an update function FθF_\theta, and a decoder GθG_\theta, following the equations:
ht=Fθ(ht1,xt,ut)h_t = F_\theta(h_{t-1}, x_t, u_t)
xt+1Gθ(ht)x_{t+1} \sim G_\theta(h_t)
Here, xtx_t represents observations (screen frames), utu_t is the time-indexed conditioning input (user actions), hth_t serves as persistent runtime memory, FθF_\theta performs state-update computation, and (xt,ut,Gθ)(x_t, u_t, G_\theta) define the I/O pathway. In the video-based prototype, hth_t is realized by the model's time-indexed video latents ztz_t, and a diffusion transformer acts as the FθF_\theta function, consuming prior latents, current observations, and conditioning inputs to produce the updated state. The decoder GθG_\theta parameterizes the distribution over the next frame. Auxiliary heads encode/decode conditioning streams like text prompts and action traces.

Two interface-specific prototypes are developed: NCCLIGen for Command-Line Interface (CLI) interaction and NCGUIWorld for Graphical User Interface (GUI) interaction.

NCCLIGen Implementation Details:
NCCLIGen models terminal interaction where xtx_t are terminal frames and utu_t is a user prompt/metadata. The architecture is based on the Wan2.1 model. The first terminal frame is encoded by a Variational Autoencoder (VAE) into a conditioning latent. Simultaneously, a CLIP image encoder extracts visual features from the same frame, and a text encoder (e.g., T5) embeds the caption. These features are concatenated with diffusion noise, projected through a zero-initialized linear layer, and processed by a Diffusion Transformer (DiT) stack. Decoupled cross-attention injects the joint caption and first-frame context. The VAE encodes and decodes terminal frames. During generation, the diffusion transformer advances the latent state ztz_t (which realizes hth_t) under the original Wan2.1 image-to-video (I2V) sampling schedule.

Data Pipelines for NCCLIGen:

  1. CLIGen (General) dataset: Built from public asciinema .cast trajectories. Sessions are replayed, rendered into terminal frames (preserving palette, cursor, geometry), and normalized. Segmented into 5-second clips and resampled to 15 FPS. Underlying buffers and logs are used to generate aligned textual descriptions in three styles (semantic, regular, detailed) using Llama 3.1 70B.
  2. CLIGen (Clean) dataset: Collected using the open-source vhs toolkit. Deterministic scripts drive Dockerized environments to capture cleaner, better-paced traces (approx. 250k scripts, 51.21% retained). Subsets include regular traces (package installation, log filtering, REPL usage) and Python math validation traces. Captions are derived directly from raw vhs scripts. Frame rendering is standardized (monospace font/size, consistent palette, locked resolution/theme).

Training Details:
Training utilizes gradient checkpointing, 0.1 dropout on prompt encoder/CLIP/VAE modules, AdamW optimizer (learning rate 5×1055 \times 10^{-5}, weight decay 10210^{-2}), bfloat16 precision, and gradient clipping at 1.0. NCCLIGen on CLIGen (General) requires ~15,000 H100 GPU hours, and on CLIGen (Clean) requires ~7,000 H100 GPU hours.

Key Findings:

  1. High-fidelity Rendering: The NC maintains high-fidelity terminal rendering at practical font sizes (e.g., 13 px), achieving good reconstruction quality (PSNR 40.77 dB, SSIM 0.989), with the Wan2.1 VAE proving adequate for CLIGen usage.
  2. Performance Plateau: On clean, domain-specific data, global reconstruction metrics (PSNR/SSIM) improve rapidly early on but plateau around 25k training steps, suggesting early saturation in reconstruction metrics rather than a complete halt in learning.
  3. Prompt Specificity: Detailed, literal captions significantly improve text-to-pixel alignment and reconstruction fidelity. PSNR increased from 21.90 dB (semantic) to 26.89 dB (detailed), a gain of nearly 5 dB, highlighting their effectiveness as a control channel.
  4. Character-level Accuracy: The models achieve substantial character-level text rendering accuracy. Character accuracy improved from 0.03 at initialization to 0.54 at 60k steps, with exact-line matches reaching 0.31, indicating the ability to model text structure, font rendering, and spatial relationships.
  5. Symbolic Computation Bottleneck: Native symbolic reasoning remains a significant challenge. On arithmetic probe tasks (100 problems from 1,000 held-out), NCCLIGen achieved only 4% accuracy, comparable to Wan2.1 (0%) and Veo3.1 (2%), much lower than Sora2 (71%).
  6. Impact of Reprompting: While native symbolic reasoning is limited, system-level conditioning (reprompting) significantly improved NCCLIGen's accuracy on arithmetic tasks from 4% to 83%. This suggests that current models are strong renderers and conditionable interfaces, and much of the apparent "reasoning" gain can come from better specification and instruction-following rather than new native computation.

The paper outlines an engineering roadmap toward CNCs, focusing on challenges such as robust long-horizon reasoning, reliable symbolic processing, stable capability reuse, and explicit runtime governance.