Show HN: My LLM CLI tool can run tools now, from Python code or plugins

Core Capabilities and CLI Use Cases

  • Single CLI interface to “hundreds” of models, with automatic logging of prompts/responses in SQLite for experiment tracking.
  • Strong shell integration: pipe files and command output into models for transformations and explanations (e.g., add type hints to code, generate commit messages from git diff, explain complex CSS).
  • Supports multimodal (e.g., llm 'describe this photo' -a photo.jpg).
  • Tool plugins allow natural-language -> command workflows (e.g., propose ffmpeg commands, then confirm to run), and substantial coding assistance by combining multiple input files/URLs.

Plugins, Ecosystem, and UIs

  • Rich plugin ecosystem: model backends (Anthropic, Gemini, Ollama, llama.cpp), MCP experiments, QuickJS and SQLite tools, terminal helpers, tmux-based assistants, Zsh/Fish helpers that turn English into shell commands, and an external GTK desktop chat UI integrating with llm.
  • Streaming Markdown rendering (Streamdown) is highlighted as a nontrivial but important UX component; there’s interest in “semantic routing” of streamed output.
  • Some users maintain shell completion plugins and small wrappers for “quick answer” or “conceptual grep” workflows.

Installation, Upgrades, and Performance

  • Users report plugins disappearing on upgrade (with uv tool or Homebrew); recommended workaround is llm install -U llm or reinstalling with --with flags. There’s a proposal to auto-restore plugins from a plugins.txt.
  • Some see slow startup (even for --help), possibly due to heavy plugin imports; profiling and lazy-import guidance are suggested.

Tool Calling Behavior and Reliability

  • Tool-calling is seen as powerful but finicky: some experience models “gaslighting” about tool execution (e.g., calendar events) when tools weren’t called.
  • One key insight: high-quality tool use often depends on very detailed system prompts and examples (thousands of tokens), which some find unsettling and brittle.

Safety, Footguns, and Responsibility

  • Strong concern that tools, especially with authenticated actions (e.g., brokerage accounts, GitHub MCP), massively increase “footgun” risk.
  • Debate over whether this is “just another tool” vs. qualitatively new risk because LLM decisions are non-deterministic and opaque.
  • Extended ethical discussion: who is responsible when an LLM-enabled system causes harm, even if builders followed “best practices”? Opinions range from “clearly the human” to deeper critiques of deploying non-verifiable models in safety-critical contexts.
  • Proposed mitigations: sandboxing, explicit user confirmation for dangerous actions, read-only tools, and designs where tools hold credentials and only expose scoped tokens/symbols to the model.

Models, Local Backends, and Cost

  • GPT‑4.1 mini is praised as very cheap and surprisingly capable; heavier models (e.g., o3/o4) used selectively for coding.
  • Local tool-calling via llama.cpp + llm-llama-server is demonstrated; users note they can also enable tools via extra-openai-models.yaml with flags like supports_tools: true.
  • Some experiment with local multimodal models and ask about latency for real-time UI automation, though actual performance remains unclear in the thread.

Broader Reflections and Limitations

  • Some see llm turning the terminal into an “AI playground,” simpler than frameworks like LangChain or OpenAI Agents for many use cases.
  • Others are uneasy: long hidden prompts for tools, lack of deterministic behavior, and inability to write strong automated tests make this feel unlike previous abstraction jumps (e.g., assembly → C).
  • There’s philosophical disagreement over whether LLMs “understand” language vs. merely simulate it—but several participants emphasize that even as “language toys,” they’re already extremely useful.
  • Minor critiques: the project name (llm) is too generic, documentation is scattered across multiple sources, and there’s a desire for more canonical, consolidated docs and a web UI.