Written by Ad Signal Co-Founder, Mike Duffy.
<De-Dupe is Easy>
De-duplication is a term that is bandied around the media and entertainment tech industry a lot. And rightly so. It’s an increasingly important topic which is appearing more and more in RFPs in the media asset management space. Organisations are looking to reduce operation spend and generally be more efficient so deleting unnecessary versions or copies of content is a priority.
That being the case it also seems that everyone can solve the problem with large percentile numbers attached to generic claims of big savings.
But de-duplication is not a one-size-fits-all solution. As an analogy we can think in terms of camera lenses; a wide angle lens is fantastic for the big picture, but will miss the detail that a zoom lens will find, and vice versa. In much the same way, de-duplication tools come in many shapes and sizes, and it’s critical you choose the right one to solve your ever-growing storage problem.
As the title says, de-dupe is easy when you look at it through the wider lens but to catch the nuances that help you find those copies or versions of video that are ‘almost’ or ‘near’ duplicates you need a more macro approach.
Broadly, when it comes to file storage there are two approaches; hash-based and perceptual-based. We’re going to touch on both techniques, see how they work, and most importantly, where they are best suited. At their core, both techniques lean on somewhat esoteric maths, but we’ll stay at the high level here; this is de-dupe 101.
Hash-based de-duplication (the easy bit)
Hash-based de-duplication is the most established form and uses a range of mathematical techniques to take a large amount of data, say a video file, and boil it down to a single string of numbers. The crucial element is that the same file will always produce the same string of numbers when fed into this algorithm. To avoid different files producing the same number (Known as a collision) lengths of 256 bits are used. This length makes it vanishingly unlikely that hashes will produce a clashing number.
From a de-duplication point of view it means that you can run the algorithm over every file in your file store, and where it finds two files with the same string of numbers (the hash) you can be confident that you’ve found a duplicate. This means that hash-based de-duplication is highly effective for cleaning out huge file stores where people have been copying and pasting files with wild abandon. It is often built right into the file storage devices, meaning that when it sees a duplicate it will store one copy but make it appear in multiple locations.
Despite being incredibly powerful, hash-based de-duplication isn’t a silver bullet. One crucial limitation is that It cannot deal with near-duplicates or segments of duplication within a file; it’s a simple guarantee that where two mathematically identical files exist it can detect them.
For instance, if you decide to chop a couple of seconds from a video, or introduce a single frame, and save it alongside the original, a hash-based de-duplication would produce a different checksum, because from the point of view of the hash, it’s no longer a duplicate. If you encode the same, identical, piece of content (advert, feature length film etc) for different delivery platforms in the eyes of hash-based calculations … it is no longer a duplicate.
Which brings us neatly onto the next tool in the de-duplication arsenal.
Perceptual de-duplication
Perceptual de-duplication can detect duplication that hash-based systems cannot. Unlike hash-based de-duplication, perceptual de-duplication is able to look at two files which are mathematically different from the point of view of a file hash and identify them as being duplicates based instead on their content. At first glance this can be a little confusing, so let’s look at an example.
You have two different encodings of the same video, perhaps one has been encoded for Youtube, and the other for distribution on a streaming platform (Netflix UHD for example). The content of the video and audio may be identical in terms of bitrate, resolution and indeed the actual footage. Despite being visually identical, a hash-based file de-duplication would see these as two completely different files and not take any action.
In contrast, a perceptual-based de-duplication approach looks at the content of the files, and produces a mathematical hash based on the actual content; it can then use this to identify that in the example above, both videos are duplicates. And perceptual-based fingerprinting isn’t limited to exact duplicates, it can identify where videos have been duplicated, but slightly altered; for instance, where the slate, black and freeze have been removed from one and not the other.
Perceptual de-duplicating has the concept of similarity, which allows it to find content that is nearly identical, but might differ in the quality of the picture, or by trims. This similarity approach allows the perceptual de-duplication to find far more duplicates than a file based hash.
In addition to being able to identify duplicates where hash-based de-duplication cannot, perceptual de-duplication can find duplicate segments across files. As an example, let’s take this famous scene from the classic sitcom Only Fools and Horses . That scene has been reused countless times, both from the original episode, but also in clip shows, documentaries and other productions. For some organisations clips like this could be used in literally hundreds, if not thousands, of different places within their media library.
Using perceptual de-duplication it is possible to identify these duplicate clips that exist in other videos, as well as the episode itself which may have been re-encoded many times. This allows organisations to either remove them to save space, to detect if content has been been changed (by design or maliciously) and to identify where they are used for rights management.
And with the increasing popularity of formats such as IMF it is possible to ‘link’ media within a larger piece, allowing a single piece of content to be re-used in many places without duplication.
File based hashing de-duplication does what it says on the tin; de-duplicates files. It can’t help with discovering the genealogy of content or protecting an organisation from reputational damage.
It’s simply not designed to be able to offer this kind of flexibility. However, unlike hash-based de-duplication, perceptual de-duplication has to be written, tuned and tested for specific file types. This makes perceptual-based de-duplication a specialist part of the wider de-duplication ecosystem, unlike hash-based de-duplication, which, although coarser, is a universal approach. In addition, perceptual de-duplication is computationally more intensive than file based hashing.
Which approach to use
So now we have completed our brief tour of the main techniques the question becomes: which is right for us? The answer is that both are valuable additions to your arsenal. Using the speed and relative low cost of hash-based de-duplication gives a quick and easy way to identify files that are exact duplicates of each other. This can quickly whittle down large collections of duplicates at low cost and with minimal effort.
When the initial hash-based de-duplication is complete you can bring in the more sophisticated perceptual-based de-duplication. This will then allow you to see where files have been duplicated, repurposed for different uses, and possibly abandoned. This shadow duplication is hidden and costly in terms of raw storage cost, but also in the organisational drag it brings.
Over to you
Hopefully this has been helpful in cutting through the fog of de-duplication. As you can tell, we’ve got an incredible depth of enthusiasm and knowledge in this area. Our Match™ application allows organisations to tame shadow duplication and runaway versionitis, cut hidden storage costs, improve compliance and simplify rights management.
Effective deduplication not only helps with the bottom line but getting clarity on what you own, and have the rights to use, will provide the opportunity to create revenue streams.