Company Updates

New Research: The Data Leaking into GenAI Tools

January 15, 2025

Our new research just dropped today and includes some unique insights into what kind of sensitive data was uploaded into GenAI tools in 2024. The report ‘From Payrolls to Patents: The Spectrum of Data Leaked to GenAI into 2024’, is a bit of a mega-analysis of Harmonic’s own dataset of what data we observed that we caught leaving businesses into GenAI tools like Copilot, ChatGPT, Perplexity, Claude, and Gemini. 

You can snag a copy of the full report here, or continue reading for some highlights. 

Concerns mounting around intellectual property leaking into GenAI

According to research by Enterprise Strategy Group, 96% of organizations are enforcing or developing governance structures for generative AI use, while 82% are concerned about data leakage. Many organizations lack visibility into what data employees are entering into GenAI tools, compounding the challenge. These fears are driven by some real-world examples with Amazon and Samsung, whereby companies reportedly discovered GenAI tools showing signs of training on highly sensitive data.

When we think of data leakage, our minds often immediately go to PII, social security numbers, credit cards, and API keys. However, for many businesses, the concerns are more often around IP – source code, intellectual property. 

High use of free tiers elevates risk

As AI adoption grows, many organizations designate a single “approved” GenAI tool, often through enterprise agreements that offer assurances against models training on customer data. However, this protection works only if employees use the paid enterprise accounts.

The use of free tiers is significant and can exacerbate risks given they don’t have the security features that come with enterprise versions. Many free-tier tools explicitly state they train on customer data, meaning sensitive information entered could be used to improve models, further compounding risks. 

In 2024, 63.8% of ChatGPT users used the free tier, with 53.5% of sensitive prompts entered here. 58.62% of Gemini users, 75% of Claude users, and 50.48% of Perplexity users also operated on free tiers.

Almost 10% of prompts contain sensitive information

In the vast majority of cases, employees are not doing anything crazy. They want to summarize a piece of text. They want to edit a blog. They want to write documentation for code. This is the mundane reality of GenAI usage across the enterprise. While most prompts were benign, 8.5% of prompts contained sensitive information.

In order to gain a better understanding of the types of data that were shared by employees into GenAI, we applied dozens of pre-trained small language models to tens of thousands of prompts. 

These types of sensitive data fall into five top-level categories: customer data, employee data, legal and finance data, security data, and sensitive code. 

For each of these categories, the report drills down into specific subcategories. For example, you can see that “legal and finance” data is broken down into a further 6 categories.

Summary

The combination of diverse sensitive data types and heavy reliance on free tools is alarming. Blocking access to GenAI tools, however, is not a viable solution. Their appeal is so strong that employees bypass restrictions, often resorting to personal devices and networks, which increases Shadow IT use.

Instead, organizations must focus on securely enabling GenAI usage while educating employees about the risks of uploading sensitive data and using free-tier versions.

Check out the full report here to learn more!

Request a demo

Team Harmonic