2024-08-29

OpenAI is good at unminifying code

Capabilities and Use Cases

Many report LLMs are strong at “text transformations”: unminifying JS, renaming identifiers, reformatting, refactoring, and translating code between languages/frameworks.
People successfully use LLMs to:
- Reverse-engineer minified JS and Shopify scripts.
- Clean up and comment messy code, or explain legacy logic and “why” decisions were made.
- Convert code across ecosystems (e.g., Python↔JS, AWS SDKs, CloudFormation↔Terraform↔CDK).
- Extract structured data (CSV/JSON) from text and parse database schemas.
Some use models alongside decompilers (e.g., Ghidra, Binary Ninja) to assist reverse engineering of binaries or assembly, with mixed but promising results.

Minification vs. Decompilation / Obfuscation

Multiple commenters stress: unminifying JS (same language, mostly renames/formatting) is far easier than decompiling binaries or undoing true obfuscation.
LLMs still struggle with heavily obfuscated or “state-of-the-art” JS and complex compiled binaries.
There’s debate on how hard the inverse problem really is; some see minification inversion as relatively easy, others note that lost semantics (names, comments) are nontrivial to reconstruct.

Tooling and Techniques

Several tools are mentioned that combine ASTs and LLMs:
- Workflows where traditional parsers ensure semantics while LLMs only suggest better names or comments.
- Local-model modes exist but are slower and less accurate; API-based modes are faster but cost tokens.
Suggested patterns:
- Use LLMs to rename variables per-scope, then apply deterministic renames via AST tooling.
- Validate LLM transformations via unit tests, mutation testing, or AST equivalence checks.

Legal, Ethical, and Licensing Concerns

Strong disagreement over whether LLM-assisted decompilation could “render all code open source.”
Several point out: having source ≠ having rights; licenses and copyright still govern use and redistribution.
Clean-room reverse engineering is discussed; using decompiled/LLM-produced code directly is seen as risky, but using it only to write specs for a separate implementation may be acceptable in some jurisdictions (details flagged as jurisdiction-dependent and unclear).

Broader Implications and Skepticism

Some see this as a big unlock for reverse engineering, refactoring, and legacy software.
Others downplay novelty, noting that beautifiers and decompilers already exist, and LLM hallucinations and correctness remain major concerns.

Related topics