Thought Leadership

Addressing Concept Drift

Background:

Vista Analytics was selected to present at the 2017 IEEE International Conference on Big Data. Our submission entitled, “Application of Dynamic Logistic Regression with Unscented Kalman Filter in Predictive Coding” was selected for presentation by the most respected data scientists and Ph.Ds. in the world (with only a 20% acceptance rate) and is a testament to Vista’s groundbreaking work in litigation technology research. The paper addresses the issue of Concept Drift during discovery, which from our perspective deals with two things. First, the language used over the course of years as well as the terms are continually changing. Second, while the attorneys are reviewing the documents throughout discovery, their understanding and knowledge of the documents changes over time also. Thus if an attorney builds one model at the beginning of the process and doesn’t take into account this Concept Drift, the set of responsive documents is completely inaccurate by the conclusion. Our approach learns new patterns at a faster rate, renders better accuracy and recall, and requires a reduced labeling cost in terms of computer performance, which when combined makes it potentially a superior alternative in updating predictive coding models. With our ensemble approach to machine learning capabilities, we are a now able to address the issue of Concept Drift in a manner that our competitors cannot.

Predictive Coding

Background:

Predictive coding (or Technology-assist review) uses machine learning and natural language processing (NLP) techniques against the gigantic datasets of forensics and eDiscovery. Predictive coding is typically used to replace or supplement document review processes for identifying responsive, privileged, confidential or other document categories. Although there has been substantial growth in the use of predictive coding over the last few years, like any other technologies, it’s not guaranteed to perform well without the proper configuration and settings. It needs to be assisted and trained with a deep understanding of the data, domain, and the technology itself, making it quite difficult in some cases and not normally a straight out-of-the-box technology.

In the next section, you will see how Vista Analytics opens the black box of predictive coding to our customers and developed a solid predictive coding solution leveraging both machine learning and cloud computing. Our solution is not only more accurate than many competitors but also more cost-effective than other solutions.

Read More

Intrusion Detection

Background:

Intrusion Detection can be defined as “…the act of detecting actions that attempt to compromise the confidentiality, integrity or availability of a resource.”1 The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between “bad” connections, called intrusions or attacks, and “good” normal connections. The trained model can then be deployed to flag any potential unauthorized or illicit connections.

Intrusion Detection is one part of a holistic approach to cyber security as evidenced by the wide range of threats that are currently plaguing the world’s computer systems. This analysis is not meant to ignore important components such as insider threats or sophisticated non-intrusion attacks. It is merely one part of the analysis.

Credit Card Default Predictive Modeling

Background:

Predicting credit card payment default is critical for the successful business model of a credit card company. An accurate predictive model can help the company identify customers who might default their payment in the future so that the company can get involved earlier to manage risk and reduce loss. It is even better if a model can assist the company on credit card application approval to minimize the risk at upfront. However, credit card default prediction is never an easy task. It is dynamic. A customer who paid his/her payment on time in the last few months may suddenly default his/her next payment. It is also unbalanced given the fact that default payment is rare compared to non-default payments. Unbalanced dataset will easily fail using most machine learning techniques if the dataset is not treated properly.

Read More

Setting up Datasets - Sampling

Background

A crucial step of setting up datasets is to sample subsets of the population to accurately and truthfully represent the huge corpuses of documents in training, validation, and testing.

The representativeness and accuracy of the sample hinge on two factors: 1) the number of documents sampled, 2) sampling technique used. By the Law of Large Numbers, larger sample sizes lead to more accurate estimations on the population parameters, such as the prevalence of responsive documents or the error rate of the classifier used. For example, a sample reveals that 5 percent of the documents are pertinent. However, without knowing whether the sample size is adequate, one cannot extrapolate the 5 percent prevalence to the entire population with certainty. Likewise, one cannot extrapolate the validation classification error rate to the test dataset without knowing whether sample size is sufficient. In practice, we balance the cost of reviewing the sampled documents and the degree of uncertainty tradeoff.

Settlement Prediction

Background:

The potential of predicting a case outcome in the earliest phases of a litigation (prior to the discovery phase) can provide significant information to attorneys in deciding whether to pursue a defense or how to gain decisive advantage over their opponents in the earliest stages of a dispute. Additionally and potentially self-evident is the financial calculation that a corporation must undertake in the budgeting of the costs related to the litigation vs. the probable outcome of the case. Vista believes that machine learning can be used successfully in courtroom prediction with the resulting benefits.

Read More