DolphinGemma: How Google AI is helping decode dolphin communication
Key Points
- 1DolphinGemma, a ~400M parameter AI model developed by Google in collaboration with the Wild Dolphin Project and Georgia Tech, analyzes and generates dolphin vocalizations to help decode their complex communication.
- 2Trained on decades of acoustic data from wild Atlantic spotted dolphins, this audio-in, audio-out model identifies patterns and predicts sound sequences, functioning similarly to large language models for human speech.
- 3Integrated with the CHAT system and running on Pixel phones, DolphinGemma aims to facilitate two-way human-dolphin interaction by recognizing and anticipating vocalizations, with plans to be open-sourced for wider research use.
The paper details the DolphinGemma project, a collaborative effort by Google, the Wild Dolphin Project (WDP), and Georgia Tech, aimed at decoding dolphin communication and facilitating interspecies interaction. The core objective is to move beyond merely listening to dolphins to understanding the patterns of their complex vocalizations and eventually generating responsive, dolphin-like sound sequences.
The foundational data for this initiative comes from the Wild Dolphin Project, which has conducted the world's longest-running underwater dolphin research since 1985, focusing on Atlantic spotted dolphins (Stenella frontalis) in the Bahamas. This research yields a rich, unique dataset of decades of meticulously paired underwater video and audio, correlating individual dolphin identities, life histories, and observed behaviors with specific sound types. Examples include signature whistles for individual identification and reunion, burst-pulse "squawks" during conflicts, and click "buzzes" for courtship or chasing sharks. This extensive, labeled acoustic database forms the bedrock for AI analysis.
DolphinGemma Methodology:
DolphinGemma is a large language model developed by Google, specifically tailored for analyzing and generating dolphin vocalizations. It leverages advanced Google audio technologies, including the SoundStream tokenizer for efficient representation of raw dolphin sound waveforms. These tokenized representations are then processed by a model architecture designed to handle complex acoustic sequences. The model boasts approximately 400 million parameters ( parameters), an optimal size chosen to enable direct execution on Google Pixel phones for field research applications.
The architecture of DolphinGemma is inspired by and built upon insights from Google's Gemma collection of open, state-of-the-art models, which share research and technology with the Gemini models. Trained extensively on WDP’s acoustic database of wild Atlantic spotted dolphins, DolphinGemma functions as an audio-in, audio-out model. Its primary function is to process sequences of natural dolphin sounds, identify underlying patterns and structures, and then predict the likely subsequent sounds in a sequence. This mechanism is analogous to how human large language models predict the next word or token in a sentence, enabling the model to learn the "grammar" or "syntax" of dolphin vocalizations. By identifying recurring sound patterns, clusters, and reliable sequences, DolphinGemma aims to uncover hidden structures and potential meanings within the dolphins' natural communication, a task previously requiring immense human effort. Future applications include the generation of synthetic sounds to establish a shared vocabulary for interactive communication with dolphins.
CHAT System (Cetacean Hearing Augmentation Telemetry):
In parallel to DolphinGemma's analytical capabilities, the CHAT system, developed by WDP and Georgia Tech, focuses on establishing a two-way interactive communication channel. CHAT is an underwater computer designed not to decipher natural dolphin language directly, but to create a simpler, shared vocabulary. This is achieved by associating novel, synthetic whistles (distinct from natural dolphin sounds and generated by CHAT) with specific objects that dolphins enjoy, such as sargassum or scarves. The goal is for curious dolphins to learn to mimic these synthetic whistles to "request" the associated items.
The CHAT system requires real-time processing capabilities:
- Acoustic input: Accurately hearing dolphin mimicry amidst ocean noise.
- Identification: Real-time identification of which specific synthetic whistle was mimicked.
- Feedback to researcher: Informing the researcher (via bone-conducting headphones underwater) of the dolphin's "request."
- Reinforcement: Enabling the researcher to quickly offer the correct object, reinforcing the connection.
Initially, a Google Pixel 6 handled high-fidelity real-time sound analysis. The upcoming generation, centered around a Google Pixel 9 (slated for summer 2025), will integrate speaker/microphone functions and leverage the phone's advanced processing to simultaneously run both deep learning models (like DolphinGemma) and template matching algorithms. The use of Pixel smartphones significantly reduces the need for custom hardware, improving system maintainability, lowering power consumption, and shrinking device cost and size—critical advantages for field research in the open ocean. DolphinGemma's predictive power is expected to enhance CHAT by anticipating and identifying potential mimics earlier in a vocalization sequence, thereby increasing researcher reaction speed and making interactions more fluid.
Future Directions:
Google plans to release DolphinGemma as an open model in summer 2025, fostering broader scientific collaboration. While trained on Atlantic spotted dolphin sounds, its utility is anticipated for researchers studying other cetacean species, requiring fine-tuning for different vocalizations. This open-source approach aims to provide researchers worldwide with tools to analyze their acoustic datasets, accelerate pattern discovery, and collectively deepen the understanding of marine mammal communication. The overarching vision is to narrow the communication gap between humans and dolphins.