De-Duplication Explained by Lexbe Inc.

Maryland Amends Ethics Rule 4.4(c)
November 30, 2023
Sanctions Update in Dropbox “Rummaging” Decision
December 1, 2023

Lexbe  Inc. (“Lexbe”) posted an informative discussion of de-duplication on LinkedIn. A link to Lexbe’s post is pasted at the foot of this blog.

Lexbe’s blog explains that there are two types of de-duplication and: “Here are the differences between them:  𝐆𝐥𝐨𝐛𝐚𝐥: Duplicates are identified and suppressed across the entire dataset (i.e., all case documents) so that only the first instance of a document is in the review set….  𝐏𝐞𝐫-𝐂𝐮𝐬𝐭𝐨𝐝𝐢𝐚𝐧: Duplicates are identified and suppressed within a smaller subset of the data (i.e., a single custodian collection).”

Lexbe notes that global de-duplication removes more information, speeding review, while custodial de-duplication is “[i]deal for cases where custodian-specific documents are identified as a critical factor during custodian interviews.”

Lexbe’s post also illustrates the differences between global and custodial de-duplication.  Lexbe’s diagram illustrating both conceps is:

Lexbe Inc. Diagram

Lexbe adds: “The deduplication process should be defined in your ESI protocol. Once the ESI protocol is agreed to by both parties, clearly document your processes and procedures and provide a guideline for how to handle duplicates to all reviewers.”

Lexbe’s post is at

UPDATE; See Craig Ball, Introducing the EDRM E-Mail Duplicate Identification Specification and Message Identification Hash (MIH) | Ball in your Court ( 16, 2023).  “Hash deduplication works well, but stumbles when minor variations prompt inconsistent outcomes for messages reviewers regard as being ‘the same.’ Hash deduplication fails altogether when messages are exchanged in forms other than those native to email communications—a common practice in U.S. electronic discovery where efficient electronic forms are often printed to static page images.”  Further, Craig points to a cross-platform de-dup issue and writes: “The EDRM has crafted a new load file field called the EDRM Message Identification Hash (MIH), described in the EDRM Email Duplicate Identification Specification….  Any party with the MIH of an email message can readily determine if a copy of the message exists in their collection. Armed with MIH values for emails, parties can flag duplicates even when those duplicates take different forms, enabling native message formats to be compared to productions supplied as TIFF or PDF images.”