We gave 5 LLMs $100K to trade stocks for 8 months

Backtesting & Data Leakage

  • Many commenters argue 8 months of backtested performance is close to meaningless, especially in a strong bull market.
  • There’s persistent skepticism that models may implicitly “know the future” via training data, even with dates chosen after stated cutoffs.
  • The project’s “time-segmented APIs” are understood as only revealing day-by-day past data, but critics note this doesn’t prevent memorized patterns or news from weights, especially for continuously updated models.

Paper Trading vs Real Markets

  • Strong pushback on using paper money: no market impact, no slippage, no queue/prioritization effects, and no execution frictions.
  • Several with trading experience say strategies that backtest or paper-trade well often fail live.
  • Emotional risk tolerance with real money is absent in simulations.

Methodological Limits

  • Constraints like one trade per day, 5–15 positions, and fixed position sizes are seen as arbitrary and unrepresentative of real strategies.
  • Only one run per model over one time interval; models are non-deterministic, so a single path tells little.
  • Prompt is very high-level; trade “reasoning” is mostly generic narrative (e.g., “investing in AI”), not evidence of genuine strategy.

Interpretation of Results

  • All high performers were heavily concentrated in US tech/semiconductors; Gemini underperformed mainly because it wasn’t.
  • Many argue the experiment mostly rediscovered “going long tech during a tech-led bull run,” not model skill.
  • Multiple commenters emphasize missing risk-adjusted metrics: no drawdowns, volatility, Sharpe/Sortino, or comparison to leveraged or sector ETFs.

Views on LLMs in Trading

  • Broad consensus that generic LLMs are not designed to be autonomous trading engines and will likely underperform long term.
  • Some see genuine value as research assistants (summarizing news, sentiment, fundamentals) or design helpers for deterministic quant models, not as the model itself.
  • Others frame the exercise as interesting AI-behavior observation, but misleading if read as “LLMs can beat the market.”

Suggestions for Better Experiments

  • Use live, real-money forward tests over years; include random and human baselines.
  • Run many intervals (bull, bear, sideways), Monte Carlo-style, with multiple seeds per model.
  • Control sector exposure, test on constrained universes (e.g., non-tech, mid-caps), and report full risk statistics and trade counts.