Thought Leadership

Predictive Coding


Predictive coding (or Technology-assist review) uses machine learning and natural language processing (NLP) techniques against the gigantic datasets of forensics and eDiscovery. Predictive coding is typically used to replace or supplement document review processes for identifying responsive, privileged, confidential or other document categories. Although there has been substantial growth in the use of predictive coding over the last few years, like any other technologies, it’s not guaranteed to perform well without the proper configuration and settings. It needs to be assisted and trained with a deep understanding of the data, domain, and the technology itself, making it quite difficult in some cases and not normally a straight out-of-the-box technology.

In the next section, you will see how Vista Analytics opens the black box of predictive coding to our customers and developed a solid predictive coding solution leveraging both machine learning and cloud computing. Our solution is not only more accurate than many competitors but also more cost-effective than other solutions.

Read More

Intrusion Detection


Intrusion Detection can be defined as “…the act of detecting actions that attempt to compromise the confidentiality, integrity or availability of a resource.”1 The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between “bad” connections, called intrusions or attacks, and “good” normal connections. The trained model can then be deployed to flag any potential unauthorized or illicit connections.

Intrusion Detection is one part of a holistic approach to cyber security as evidenced by the wide range of threats that are currently plaguing the world’s computer systems. This analysis is not meant to ignore important components such as insider threats or sophisticated non-intrusion attacks. It is merely one part of the analysis.

Credit Card Default Predictive Modeling


Predicting credit card payment default is critical for the successful business model of a credit card company. An accurate predictive model can help the company identify customers who might default their payment in the future so that the company can get involved earlier to manage risk and reduce loss. It is even better if a model can assist the company on credit card application approval to minimize the risk at upfront. However, credit card default prediction is never an easy task. It is dynamic. A customer who paid his/her payment on time in the last few months may suddenly default his/her next payment. It is also unbalanced given the fact that default payment is rare compared to non-default payments. Unbalanced dataset will easily fail using most machine learning techniques if the dataset is not treated properly.

Read More

Setting up Datasets - Sampling


A crucial step of setting up datasets is to sample subsets of the population to accurately and truthfully represent the huge corpuses of documents in training, validation, and testing.

The representativeness and accuracy of the sample hinge on two factors: 1) the number of documents sampled, 2) sampling technique used. By the Law of Large Numbers, larger sample sizes lead to more accurate estimations on the population parameters, such as the prevalence of responsive documents or the error rate of the classifier used. For example, a sample reveals that 5 percent of the documents are pertinent. However, without knowing whether the sample size is adequate, one cannot extrapolate the 5 percent prevalence to the entire population with certainty. Likewise, one cannot extrapolate the validation classification error rate to the test dataset without knowing whether sample size is sufficient. In practice, we balance the cost of reviewing the sampled documents and the degree of uncertainty tradeoff.