
ES_Trading_Professional_Analysis | Kaggle
Key Points
- 1This paper introduces a benchmark to evaluate large language models' (LLMs) ability to generate professional, risk-aware one-day-ahead long/short trading signals for E-mini S&P 500 futures using OHLCV data and chart images.
- 2The evaluation revealed that while all LLMs outperformed a buy-and-hold strategy in volatility and maximum drawdown, none surpassed it in total return or CAGR, indicating strong risk control but limited monetization of signals.
- 3The study concludes that LLMs exhibit genuine directional skill, with their return underperformance primarily stemming from conservative exposure management (structural limitation) rather than a lack of informational alpha, suggesting potential for improved performance with better signal scaling.
This paper investigates the capability of Large Language Models (LLMs) to generate professional, risk-aware trading analysis for the E-mini S&P 500 (ES) futures market. The primary objective is to determine if LLMs can produce directional signals, confidence levels, and technical confirmations using only public OHLCV (Open, High, Low, Close, Volume) data and a single price chart image, focusing on risk awareness and regime sensitivity rather than solely raw returns.
The core methodology involves a benchmark task named "ES\_Trading\_Professional\_Analysis," designed to evaluate one-day-ahead long/short trading signal generation for ES futures, coupled with professional-style reasoning based on observable price structure. The experimental setup utilizes E-mini S&P 500 (ES) Futures as the instrument with a one-day prediction horizon, exploring a long/short strategy. Inputs to the LLMs comprise historical OHLCV time series data and a single 180-day price chart image. The backtesting period spans from January 1, 2025, to November 25, 2025, encompassing 228 trading days. A passive buy-and-hold strategy for ES futures serves as the baseline for performance comparison. To standardize evaluation, LLM outputs, representing directional signals, were mapped to discrete numerical values: Long positions correspond to signals in the range , Short positions to , and Neutral/Wait signals are represented by .
Performance assessment employs a comprehensive suite of metrics. Return-based metrics include Total Return and Compound Annual Growth Rate (CAGR). Risk-based metrics encompass the Sharpe Ratio (, where is portfolio return, is risk-free rate, and is portfolio standard deviation) and Sortino Ratio (, where is downside deviation), Maximum Drawdown, and Volatility. Additionally, rolling equity and rolling risk diagnostics are used to analyze performance across different market regimes.
The results indicate that while all evaluated LLMs successfully outperform the buy-and-hold benchmark in terms of volatility and maximum drawdown, none surpass it in total return or CAGR. This suggests LLMs demonstrate robust risk control but fail to fully monetize their directional signals. A regime-based analysis further dissects performance:
- Phase 1 (January–March): The market was choppy, and conservative signal scaling by LLMs limited upside capture, leading to buy-and-hold outperforming all LLMs.
- Phase 2 (March–August): This phase was characterized by corrections and regime transitions, where the adaptive, risk-aware signals of LLMs provided an advantage, with all LLMs outperforming buy-and-hold. Gemini-Flash v2.5, Gemini-Pro v3, and DeepSeek v3.1 were the top performers.
- Phase 3 (August–November): A strong, persistent bullish trend favored buy-and-hold, which decisively outperformed all LLMs. This underperformance was identified as structural, stemming from LLMs operating at approximately 50% effective exposure due to conservative signal magnitudes (e.g., ) compared to the buy-and-hold's near 100% exposure.
DeepSeek v3.1 was highlighted for its strong directional characteristics, exhibiting an 81% long bias, a short-side win rate exceeding 61%, and a long-side win rate exceeding 59%. Its risk management was also notable with a Sharpe ratio of approximately and a maximum drawdown of approximately . These results point to genuine directional skill and disciplined risk management, where the alpha generated is not fully monetized due to conservative exposure. The paper suggests that simple transformations, such as scaling signal amplitudes from to , could substantially improve return metrics without materially increasing risk.
Key takeaways from the study are that grounded reasoning in LLM outputs is crucial for deployment. LLMs demonstrate a strong capacity to learn risk-aware directional behavior, effectively controlling volatility and drawdown even with public data and a single instrument. The primary limitation to their performance lies not in the quality of their directional insights but in their exposure management, leading to structural underperformance in returns. The paper concludes that alpha exists, but its full potential can only be unlocked through improved monetization strategies, such as regime-aware scaling and adaptive exposure mechanisms. This benchmark suggests LLMs are already capable of professional, risk-aware trading analysis, with their main limitation being the translation of insight into effective exposure.