Vibe Password Generation: Predictable by Design - Irregular
Blog

Vibe Password Generation: Predictable by Design - Irregular

2026.02.23
·Web·by 권준호
#AI Agents#Cryptography#LLM#Password Security#Vulnerability

Key Points

  • 1LLM-generated passwords, despite appearing strong, are fundamentally insecure because LLMs predict tokens, leading to predictable patterns, repetitions, and significantly lower actual entropy than perceived.
  • 2Extensive testing across major LLMs (Claude, GPT, Gemini) revealed severe biases in character selection and frequent password repetitions, with quantitative analysis showing that perceived 100-bit passwords often have less than 30 bits of real entropy.
  • 3The paper concludes that LLM-generated passwords are easily crackable and recommends users avoid them, while developers and AI labs should ensure coding agents and models prioritize traditional, cryptographically secure password generation.

Large Language Model (LLM)-generated passwords are fundamentally insecure despite their apparent strength, a critical flaw arising from LLMs' design to predict tokens, which is antithetical to secure, uniform random sampling required for password generation. This insecurity is exacerbated by the increasing accessibility of AI tools to non-technical users and the silent integration of LLM-generated passwords by coding agents in software development.

The paper meticulously analyzes the strength of passwords generated by state-of-the-art LLMs, including Claude, GPT, and Gemini. Initial assessment of individual LLM-generated passwords, using standard entropy calculators, often suggests high strength (e.g., 100 bits of entropy for a 16-character password, implying centuries to crack). However, generating multiple passwords from these LLMs reveals significant underlying weaknesses.

The methodology involves prompting various LLMs with "Please generate a password" multiple times in fresh, independent conversations. Qualitative observations from these tests consistently demonstrate predictable patterns and biases:

  • Claude Opus 4.6: Among 50 generated passwords, only 30 were unique. A single password, "G7kL9#mQ2&xP4!w", repeated 18 times (36% probability). Passwords consistently started with a letter (often 'G') followed by '7'. Character choices were highly uneven (e.g., 'L', '9', 'm', '2', '', '#' appeared in all 50, while others were rare or absent). No repeating characters were observed within a single password, indicating a bias against patterns that "look less random."
  • GPT-5.2: Similarly, nearly all passwords began with 'v', and almost half continued with 'Q'. Character selection was narrow and uneven.
  • Gemini 3 Flash: Almost half of the passwords started with 'K' or 'k', usually followed by '#', 'P', or '9', exhibiting few and uneven character choices. Gemini 3 Pro even displayed a security warning, though for an incorrect reason (data processing rather than inherent weakness).

The quantitative analysis measures password strength in bits of entropy. A truly random password with a character set of 70 (26 uppercase, 26 lowercase, 10 digits, 8 symbols) would ideally yield approximately log2(70)6.13\log_2(70) \approx 6.13 bits of entropy per character.

Two methods were employed to estimate entropy:

  1. Entropy Estimation via Character Statistics:
This method involves analyzing the character distribution across many generated passwords. The Shannon entropy formula, H(X)=i=1nP(xi)log2P(xi)H(X) = - \sum_{i=1}^n P(x_i) \log_2 P(x_i), is used, where P(xi)P(x_i) is the probability of a specific character or sequence.
  • For Claude Opus 4.6, the overall character distribution was highly skewed. For instance, analyzing the first character across 50 passwords, 'G' appeared with a probability of 0.52. Calculating the entropy for the first character using observed probabilities (e.g., 'G': 0.52, 'g': 0.18, etc.) yielded approximately 2.08 bits, significantly lower than the expected 6.13 bits for a uniform distribution.
  • Extending this analysis to all character positions in a 16-character password, the estimated total entropy was about 27 bits, far below the expected 98 bits for a truly random password of that length. This 27-bit entropy implies a password crackable in seconds, as opposed to billions of years.
  1. Entropy Estimation via Log-probabilities (Logprobs):
This more direct method utilizes the LLM's internal token probability vector, accessible in some models like GPT-5.2 (logprobs=Truelogprobs=True). The log-probabilities are converted to actual probabilities (P(xi)=elogprobP(x_i) = e^{\text{logprob}}).
  • For a GPT-5.2 generated password like "V7m#Qp4!zL9@tR2^xN6$", the analysis of the first token's probabilities (e.g., 'v' at 46.3%, 'V' at 18.7%, 'm' at 18.7%, etc.) showed a highly biased distribution. Applying Shannon's formula to these probabilities yielded an estimated entropy of about 2.19 bits for the first character, again far less than the ideal 6.13 bits.
  • This analysis revealed that most characters within the LLM-generated passwords had less than 1 bit of entropy, meaning guessing the next character was easier than a coin flip.

The paper also highlights that LLM-generated passwords are prevalent in the real world due to:

  • User accessibility: Less tech-savvy users may default to ubiquitous AI tools for password generation, unaware of or not prioritizing secure methods.
  • Coding Agents: Popular coding agents (e.g., Claude Code, Codex, Gemini-CLI) are prone to generating passwords internally via the LLM rather than using standard secure generation tools (like openssl rand). This "vibe-password-generation" often occurs without developer knowledge or review, especially when tasks don't explicitly request secure generation or if prompt phrasing is slightly altered (e.g., "suggest a password" vs. "generate a password"). Examples include passwords embedded in docker-compose files, FastAPI API keys, and bash scripts for database setup.
  • Agentic Browsers: LLM-based agentic browsers (e.g., ChatGPT Atlas) also exhibit this behavior when performing tasks like user registration.

Finally, the paper demonstrates that increasing the "temperature" parameter in LLM token sampling, which intuitively should increase randomness, does not significantly improve password strength. Closed-source models typically cap temperature values, and even at maximum settings, the strong patterns and repetitions persist. Conversely, a minimum temperature (e.g., 0.0) leads to the same password being generated repeatedly, further underscoring the predictability.

In conclusion, LLM-generated passwords, despite their superficial complexity, are fundamentally predictable due to the LLMs' inherent token prediction mechanisms and exhibit low actual entropy, making them highly vulnerable to brute-force attacks. The paper strongly recommends avoiding their use, directing coding agents to use secure methods, and training AI models to prioritize such methods by default.