Blog

5 Ways Duplicate and Obsolete Unstructured Content is Killing Your Business

For many organizations it’s been a fifteen year data governance battle to get users to put unstructured content (documents, records, drawings, images) into a single Enterprise Content Management (ECM) repository.

This battle has been exacerbated with Covid and remote workers who have created ‘convenience copies’ of documents across personal drives, shared drives, Microsoft Teams, OneDrive, SharePoint and eMail attachments. Unstructured content for most organizations is growing by terabytes or petabytes (approximately 60% per month), with as much as 70% of that data being outside of governance rules and not easily searchable by other users.

Users are not bypassing the ECM just to be contrary, the truth is that keeping all of their unstructured content in the ECM makes them less efficient and less effective in doing their day jobs and driving the outcomes that they’re responsible for achieving. Hence the never ending battle between data governance professionals trying to achieve the organizations’ data governance objectives and without significantly compromising user productivity.

Duplicate and obsolete copies of unstructured content have a significant negative impact on your business

Five Significant Duplicate and Obsolete Unstructured Content Problems

Unstructured data audits conducted by customers in Nuclear Power, Oil & Gas and Financial industries revealed an average of 30% – 50% redundancy of unstructured content, with copies of content distributed across personal drives, shared drives and MS Teams sites.

This creates several challenges:

Problem #1) Wasted Document Search Time: Multiple copies of content makes it very difficult for users to know if they’re accessing the most current version of content. This can lead to incorrect decisions, time wasted validating data and re-doing work that’s already been completed.

Problem #2) Migrations Project Delays: Multiple copies of content makes migrations from on-prem to cloud, or from one storage repository to another much more time consuming and difficult. Based on the above industry stats that 30% – 50% of unstructured content are duplicate or obsolete files – duplicate content could extend a migration project timeline by 2X or 3X while unintentionally perpetuating poor content hygiene in the new content repository.

Problem #3) Inconsistently Applied Document Lifecycle Retention Rules: Many organizations already struggle to uniformly and consistently applying document lifecycle retention rules across terabytes or petabytes of content. Trying to apply retention rules across multiple versions of content in multiple repositories makes a difficult situation exponentially worse.

Problem #4) Difficulty Ensuring PII security (Personally Identifiable Information) and GDPR and CAA compliance: Most organizations are very focused on preventing data breaches, and in ensuring PII doesn’t escape out into the wild. Multiple versions of documents puts both data security and regulatory compliance at risk.

Problem #5) Increased Cloud Egress and Storage Costs: Storage costs are relatively cheap (unless 30% – 50% of your content is redundant), but for those moving to the cloud, egress costs (downloading content out of the cloud environment) can quickly add up to a sizeable cost when users are downloading convenience copies of content from cloud repositories to local hard drives or shared drives. 

So what’s the solution?

Solution #1) Improve Content Visibility: Use automation to crawl your content and identify the content distribution across repositories and confirm if content is classified correctly and if sensitive content (or copies of sensitive content) need to be moved or deleted. Bringing visibility to the distribution of your existing content is the first step to effectively managing it.

Solution #2) Identify and Eliminate Duplicates to Accelerate Migrations: Use automation to identify duplicate content (using Hashing) and delete duplicate copies to accelerate content migration, including links, indexing, classification and meta-tags.

Solution #3) Consistently Apply Defensible Disposition & Retention Rules: Use automation to correctly and consistently apply lifecycle retention rules, reducing risk and enabling better regulatory compliance.

Solution #4) Enable Enterprise Content Search Across Repositories: Provide users with ‘google like’ search to find content no matter which repository it resides in. This by itself will dramatically reduce the need for users to create convenience copies of content, because users will be able to find the most recent version of the appropriate content without having to engage in an Easter egg hunt in different repositories, or have to make local copies for themselves.

Solution #5) Reduce Egress & Storage Costs: By making content visible, eliminating duplicates and making document search across repositories easy, this can eliminate egress and storage costs significantly. Users will always find the content they need on their first try without having to resort to making convenience copies and storing it in multiple local repositories.

Unstructured content management technology has advanced significantly in the last 5 years. Continuing to try to manually audit terabytes and petabytes of unstructured content isn’t very scalable

Scroll to Top