2025-01-05

Extracting AI models from mobile apps

Role of resize_to_320.tflite and basic ML details

Commenters note a .tflite file that only does image resizing via standard TensorFlow ops, not an “AI model” for resizing.
Size (~7.7 KB) implies almost no learned weights.
Clarifies that TensorFlow is a general compute framework; many vision models require fixed low‑resolution inputs.

Status of AI models as intellectual property

Strong debate over whether model weights are copyrightable or just “facts”/coefficients produced mechanically.
Some argue:
- Copyright generally requires human authorship; automated weights may not qualify.
- Weights may be better treated as trade secrets, or protected via contracts and licenses.
- Training-set curation and model implementations are clearly copyrightable; architectures may be patentable.
Others counter:
- Models are licensed (e.g., LLaMA, Stable Diffusion, banknote‑net), implying they’re treated as IP.
- Compilations and compiled code are copyrighted even if produced by automated tools, suggesting an analogy for weights.
Consensus: legal status of model weights is unclear and largely untested in court.

DMCA, circumvention, and legality of extraction

DMCA §1201: circumventing effective access controls can be illegal even without redistribution, but only for works protected by copyright.
Discussion of broad interpretations (any copy‑prevention scheme) vs case law limiting DMCA to actual copyrighted works.
Extracting models via reverse‑engineering tools may produce illegal “circumvention tools” in some cases; legality is unsettled and jurisdiction‑dependent.

Training on copyrighted data vs claiming IP on models

Many criticize big AI firms for training on unlicensed copyrighted data while asserting strong IP over resulting models (“rules for thee, not for me”).
Disagreement over whether training is fair use:
- Pro side: highly transformative, analogous to learning; weights are statistics over many works.
- Con side: models can regurgitate training data, can undermine creators’ livelihoods, and scale content production massively.
Some say if models are protected, training on copyrighted data should not simultaneously be fair use; others separate those questions.

Model “laundering” and distillation

Techniques like model distillation and training on synthetic/model‑generated data are common; could be used to avoid direct copying of proprietary weights.
Legal treatment of such derivative models is unclear.

On‑device models, extraction risk, and DRM

General principle: anything shipped to a user device can be extracted with enough effort; mobile apps are not a secure place for “secret sauce”.
Frida is highlighted as a powerful dynamic instrumentation tool; approach extends to recovering tokenizers and pre/post‑processing by observing framework calls.
Ideas for protection:
- Encrypt models for specific inference runtimes (e.g., CoreML with public/private keys).
- Use GPU/TEE/DRM‑style secure hardware so decrypted data never leaves the device’s protected area.
Counterpoint: given physical access, skilled attackers can still use hardware attacks (fault injection, power analysis, etc.); any device that must run matrix multiplies on decrypted data is ultimately attackable.

Cloud vs on‑device inference

Hosting models remotely (e.g., via Firebase) avoids shipping them but introduces:
- Ongoing compute costs, latency, and bandwidth use.
- Loss of offline functionality.
Hybrid schemes (partial cloud, partial device) are discussed as possible but technically complex.

Use of open models in the example

The extracted banknote recognition model used as the demo is publicly available, trained on open data, and MIT/CDLA‑licensed; commenters see this as a safe and illustrative target.
Some speculate this choice avoids demonstrating the technique on truly proprietary models.

Community reception and educational value

Many appreciate the article as an accessible intro to Frida and mobile reverse engineering, especially for newer ML engineers or security‑curious readers.
Others downplay novelty but agree it effectively illustrates that “what runs on your device can be recovered.”

Related topics