We gave 5 LLMs $100K to trade stocks for 8 months
Backtesting & Data Leakage
- Many commenters argue 8 months of backtested performance is close to meaningless, especially in a strong bull market.
- There’s persistent skepticism that models may implicitly “know the future” via training data, even with dates chosen after stated cutoffs.
- The project’s “time-segmented APIs” are understood as only revealing day-by-day past data, but critics note this doesn’t prevent memorized patterns or news from weights, especially for continuously updated models.
Paper Trading vs Real Markets
- Strong pushback on using paper money: no market impact, no slippage, no queue/prioritization effects, and no execution frictions.
- Several with trading experience say strategies that backtest or paper-trade well often fail live.
- Emotional risk tolerance with real money is absent in simulations.
Methodological Limits
- Constraints like one trade per day, 5–15 positions, and fixed position sizes are seen as arbitrary and unrepresentative of real strategies.
- Only one run per model over one time interval; models are non-deterministic, so a single path tells little.
- Prompt is very high-level; trade “reasoning” is mostly generic narrative (e.g., “investing in AI”), not evidence of genuine strategy.
Interpretation of Results
- All high performers were heavily concentrated in US tech/semiconductors; Gemini underperformed mainly because it wasn’t.
- Many argue the experiment mostly rediscovered “going long tech during a tech-led bull run,” not model skill.
- Multiple commenters emphasize missing risk-adjusted metrics: no drawdowns, volatility, Sharpe/Sortino, or comparison to leveraged or sector ETFs.
Views on LLMs in Trading
- Broad consensus that generic LLMs are not designed to be autonomous trading engines and will likely underperform long term.
- Some see genuine value as research assistants (summarizing news, sentiment, fundamentals) or design helpers for deterministic quant models, not as the model itself.
- Others frame the exercise as interesting AI-behavior observation, but misleading if read as “LLMs can beat the market.”
Suggestions for Better Experiments
- Use live, real-money forward tests over years; include random and human baselines.
- Run many intervals (bull, bear, sideways), Monte Carlo-style, with multiple seeds per model.
- Control sector exposure, test on constrained universes (e.g., non-tech, mid-caps), and report full risk statistics and trade counts.