What is shadow data?
In simple terms, shadow data is your company’s data that is copied, backed up, or housed in a data store not governed, under the same security structure, or kept up-to-date by security or IT.
As an example, think about your main production data store. Of course, this is where you have your content, applications, and data accessible to all those who require it, but you are also keenly aware of it, keep it up to date, and have rigid security protocols in place. By contrast, consider the copies that are made of the data in that production database that are not being secured: the copy that exists in a test environment, in an unmanaged backup that started as a lift and shift, or orphaned backup and abandoned databases.

What is the difference between shadow IT and shadow data?
You’ve likely heard the term “shadow IT.” This is the technology, hardware, software, applications, or technology projects that are run outside the governance and oversight of your corporate IT.
At one point, shadow IT was scary, a major threat to the security of an organization’s data. However, as the challenge became more known and companies took it seriously, teams figured out how to manage and contain it.
Since then, major advancements in technology – like the mass migration to the cloud – have brought us data democratization, which in itself is a boon to all organizations and consumers. Your data is important, and allowing greater access to this data for those who need it creates more opportunities, more effectiveness.
However, the cloud also allowed data to be spread around to various places you may not even be tracking. Gone are the days of completely self-contained, on-premise systems. With greater access comes greater risk. And now a new threat has arrived. One that, in comparison, dwarfs the risk of shadow IT. It’s the largest threat to your data security: shadow data.
Do you know where your sensitive data lives? And do you have the tools and resources to manage it? Shadow data is a prominent yet frequently overlooked problem, but there are tools and resources to tackle it and secure your most valuable currency – your data.
Why does shadow data occur?
As more and more companies move to the cloud, the landscape of cloud technologies expands and becomes more complex. As more and more developers utilize the flexibility of the cloud to spin up new data storage assets with the click of a button, without consulting security or IT, the data attack surface also increases. Add in data democratization and the lack of a perimeter and shadow data becomes increasingly prevalent, as does the risk of data breach, as traditional data security strategies fail to keep up.
There are four major factors that have changed cloud data protection and given way to the specter of shadow data:
The proliferation of technology and the associated high complexity: Dozens of technologies are used to store, access, and share data in the cloud. They can be managed by the service provider or developers directly, and often each one is configured differently. This has created multiple architectures that rapidly change and bring new risks. Today, developers can spin up or copy an entire datastore in seconds.
Data protection teams have fallen behind: Today, data protection teams can’t stop developers from making changes. They can try to set guardrails to allow fewer mistakes. They are relegated to a ‘catch up’ mode. Continually kept in the dark, they can no longer assume they know where all the data is. So they spend more time asking questions and hoping that policies are being followed.
Data democratization: As more value is placed on the concept of making data available to all that need it, the risks increase. And manual efforts to categorize and secure all the data stores are ineffective.
No on-premises perimeter: Cloud data is a shared data model. It’s meant to be accessible from anywhere, given the right credentials. There is no longer a single choke point of protection and monitoring.
What are examples of shadow data?
Think about where all of your data might live. And then think about where copies of this data may exist. In a typical example, you likely have the following:
Test environment: Most organizations have a partial copy of their production or RDS database in a development or test environment, where developers are building applications and testing programs. Many times developers are moving quickly and may take a snapshot of the data but fail to properly remove or secure the copied data. Or they simply forget about it.

S3 backups: You’ll also have at least one backup data store, as a means to be prepared for any breaches or damage to your production environment. It’s your contingency plan and it stores exact copies of your production data. But these are often an afterthought and less monitored and, therefore, can mistakenly expose large amounts of data to the public.

Leftover data from cloud migration: As many organizations move to the cloud, it obviously requires a “lift and shift” data migration project, where the original database was moved into a modern cloud data store. But more often than not, the original data never got deleted, so that lingering instance remains unmanaged, unmaintained, and often forgotten.

Toxic data logs: Developers and log frameworks log sensitive data, which creates sensitive files that are not classified as sensitive, lack the proper access control and encryption, and can be easily exposed.

Analytics pipeline: Of course, your data is only useful if you can consistently reference and analyze it, so many companies will store data in some type of analytics pipeline using the likes of Snowflake or others.

All of these are unique data stores in and of themselves and any of them can be a dangling S3 backup, an unlisted embedded data store, or simply a stale data store. The problem is, they all contain sensitive data—customer information, employee information, financial data, applications, intellectual property, etc. And, most likely, they’re not visible to your data protection teams. They’ve become invisible, unmanaged, and unsecured.
Data breach examples caused by shadow data
Shadow data can be your biggest vulnerability. In a lot of cases, this data is not used anymore. Forgotten about or not even visible or accessible to corporate IT teams. On the whole, the people in your organization who should know about these stores of data don’t know about them, leaving it open prey to cybercriminals.
In fact, most data breaches occur in shadow data environments.
Take for example the SEGA Europe data breach, where the massive gaming company inadvertently left users’ personal information publicly accessible on an Amazon Web Services S3 bucket.
The mishap allowed hackers and cybercriminals to dig into many of SEGA Europe’s cloud services and API keys to their instances of MailChimp and Steam, which provided full access to these services for anyone who found it.
Fortunately for SEGA, the joint efforts of SEGA’s internal security team and a team of external security researchers resulted in early discovery and containment of sensitive data.
How did this happen? Shadow data. Someone inadvertently stored secure, sensitive files in a publicly accessible AWS S3 bucket and didn’t realize the extent of vulnerability. It is quite easy to misconfigure an Amazon AWS bucket, and that little mistake could have cost the company irreparable damage.
X/Twitter also experienced something quite similar where, a ‘glitch’ caused users' personal information and passwords to be stored in a readable text format on their internal system, rather than disguised by their process known as “hashing”.
The mishap caused embarrassment and scrutiny for X. The major social platform had to publicly urge its more than 330 million users to change their passwords.
For many organizations, a simple breach of one of their shadow data environments could be crippling.
How to discover, monitor and minimize the risks of shadow data
Unmanaged data stores inevitably occur. Shadow data occurs. It’s unintentional, and it’s a normal byproduct of an organization moving at the pace of the cloud. But there are ways to ensure you’re protected and have the proper visibility into every place your data may live.
Continuous Monitoring
Catalog Data: Relationships, Flows & Dependencies
Data Hygiene
Proactive Data Protection
Cloud-native monitoring solutions built in the cloud, for the cloud now exist to combat shadow data and allow data protection teams to move at the speed of the cloud. These cloud data security solutions must Discover and Classify continuously for complete visibility; Secure and Control to improve risk posture and Detect Leaks; and Remediate without interrupting data flow.
As you evaluate solutions to protect your sensitive cloud data, ensure you have a platform that can scan your entire cloud account and automatically detect all data stores and assets, not just the known ones. Ensure that once data is scanned, the solution can categorize and classify the data, maintaining a cloud datastore framework that allows you to prioritize and manage all of your assets effectively.
Having full data observability lets you understand where your shadow data stores are and who owns them. Doing so leads to a secure environment, faster, smarter decision-making across the enterprise, and the ability to thrive in a fast-moving, cloud-first world.