Predictive Coding In E-Discovery: The Game Of Convenience

Back in 2012, Magistrate Judge Andrew Peck’s decision in Da Silva Moore v. Publicis Groupe & MSL Group, 287 F.R.D. 182 (S.D.N.Y. 2012), officially gave the green signal to start utilizing TAR in e-Discovery. The same Judge recently issued an opinion in Rio Tinto PLC v. Vale S.A., 14 Civ. 3042, 2015 WL 872294 (S.D.N.Y. March 2, 2015), titled “Da Silva Moore Revisited”, and stipulated sharing of “seed sets” between parties.

Importantly, the opinion reiterates that “courts leave it to the parties to decide how best to respond to discovery requests” and that courts are “not normally in the business of dictating to parties the process that they should use”.

Importantly, Judge Peck instructed that requesting parties can utilize other means to help ensure TAR training, even without production of seed sets. For instance, the honorable Judge suggested statistical estimation of recall towards the end of the review to determine potential gaps in the production of documents.

Yet, in cases such as Biomet M2a Magnum Hip Implant Prods. Liab. Litig., NO. 3:12-MD-2391, 2013 WL 6405156 (N.D. Ind. Aug, 21, 2013), for example, the court declined to compel identification of seed set, however, encouraged cooperation between parties.

So, where are we going with TAR?

According to the Grossman-Cormack glossary of technology-assisted review with foreword by John M. Facciola, U.S. Magistrate Judge, seed set is “The initial Training Set provided to the learning Algorithm in an Active Learning process. The Documents in the Seed Set may be selected based on Random Sampling or Judgmental Sampling. Some commentators use the term more restrictively to refer only to Documents chosen using Judgmental Sampling. Other commentators use the term generally to mean any Training Set, including the final Training Set in Iterative Training, or the only Training Set in non-Iterative Training”. The important thing to know about seed sets is that they are how the computer learns. It is critical that a seed set is representative and reflects expert determinations.

With this in mind, in one of my articles back in April 2014 titled “E-Discovery Costs vs. Disseminating Justice – What’s Important?” I concluded that technology must strictly be used as a tool in aid to the due-process of law.

As an attorney, I love a good argument corroborated as well as substantiated by solid precedents. Use of TAR in e-Discovery invariably is becoming a matter of “convenience” between both parties in trying to resolve issues. Well, we have arbitration laws for that matter!

Take the course below and sharpen your Excel skills!


e-Discovery and | cloud computing
New Jersey, USA | Lahore, PAK | Dubai, UAE
www.claydesk.com
(855) – 833 – 7775
(703) – 646 – 3043

Please follow and like us:

The trade-off between ‘Recall’ and ‘Precision’ in predictive coding (part 2 of 2)

This is the second part of the two-part series of posts relating to information retrieval by applying predictive coding analysis, and details out the trade-off between Recall and Precision. For part 1 of 2, click here.

To clarify further:

Precision (P) is the fraction of retrieved documents that are relevant, where Precision = (number of relevant items retrieved/number of retrieved items) = P (relevant | retrieved)

Recall (R) is the fraction of relevant documents that are retrieved, where Recall = (number of relevant items retrieved/number of relevant items = P (retrieved | relevant)

Recall and Precision are inversely related. A solid criticism of these two metrics is the aspect of biasness, where certain record may be relevant to a person, may not be relevant to another.

So how do you gain optimal values for Recall and Precision in a TAR platform?

Let’s consider a simple scenario:

• A database contains 80 records on a particular topic

• A search was conducted on that topic and 60 records were retrieved.

• Of the 60 records retrieved, 45 were relevant.

Calculate the precision and recall.

Solution:

Using the designations above:

• A = Number of relevant records retrieved,

• B = Number of relevant records not retrieved, and

• C = Number of irrelevant records retrieved.

In this example A = 45, B = 35 (80-45) and C = 15 (60-45).

Recall = (45 / (45 + 35)) * 100% => 45/80 * 100% = 56%

Precision = (45 / (45 + 15)) * 100% => 45/60 * 100% = 75%

So, essentially – the optimal result – high Recall with high Precision is difficult to achieve.

According to Cambridge University Press:

“The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances. Typical web surfers would like every result on the first page to be relevant (high precision) but have not the slightest interest in knowing let alone looking at every document that is relevant. In contrast, various professional searchers such as paralegals and intelligence analysts are very concerned with trying to get as high recall as possible, and will tolerate fairly low precision results in order to get it. Individuals searching their hard disks are also often interested in high recall searches. Nevertheless, the two quantities clearly trade off against one another: you can always get a recall of 1 (but very low precision) by retrieving all documents for all queries! Recall is a non-decreasing function of the number of documents retrieved. On the other hand, in a good system, precision usually decreases as the number of documents retrieved is increased”


e-Discovery | cloud computing
New Jersey, USA | Lahore, PAK | Dubai, UAE
www.claydesk.com
(855) – 833 – 7775
(703) – 646 – 3043

Recall and Precision

Recall and Precision

Please follow and like us:

The trade-off between ‘Recall’ and ‘Precision’ in predictive coding (part 1 of 2)

This is a two-part series of posts relating to information retrieval by applying predictive coding analysis, and details out the trade-off between Recall and Precision.

Predicting Coding – sometimes referred to as ‘Technology Assisted Review’ (TAR) is basically the integration of technology into human document review process. The two-fold benefit of using TAR is speeding up the review process and reducing costs. Sophisticated algorithms are utilized to produce relevant set of documents. The underlying process in TAR is based on concept of Statistics.

In TAR, a sample set of documents (seed-sets) are coded by subject matter experts, acting as the primary reference data to teach TAR machine recognition of relevant patterns in the larger data set. In simple terms, a ‘data sample’ is created based on chosen sampling strategies such as random, stratified, systematic, etc.

Remember, it is critical to ensure that seed-sets are prepared by subject matter experts. Based on seed-sets, the algorithm in TAR platform starts assigning predictions to the documents in the database. Through an iterative process, adjustments can be made on the fly to reach desired objectives. The two important metrics used to measure the efficacy of TAR are:

  1. Recall
  2. Precision

Recall is the fraction of the documents that are relevant to the query that are successfully retrieved, whereas, Precision is the fraction of retrieved documents that are relevant to the find. If the computer, in trying to identify relevant documents, identifies a set of 100,000 documents, and after human review, 75,000 out of the 100,000 are found to be relevant, the precision of that set is 75%.

In a given population of 200,000 documents, assume 30,000 documents are selected for review as the result of TAR. If 20,000 documents are ultimately found within the 30,000 to be responsive, the selected set has a 66% precision measure. But if another 5,000 relevant documents are found in the remaining 170,000 that were not selected for review, which means the set selected for review has a recall of 80% (20,000 / 25,000).

Click here to read part 2 of 2.


e-Discovery | cloud computing
New Jersey, USA | Lahore, PAK | Dubai, UAE
www.claydesk.com
(855) – 833 – 7775
(703) – 646 – 3043

CEO ClayDesk

Syed Raza

Please follow and like us: