To make sense of prediction analytics, it helps to speak the lingo. Below are some basic terms.
Algorithm: Refers to the procedure used to solve a specific problem. These days, the procedure is often encoded into a computer program. In predictive analytics, it tends to refer to a particular approach to attaining a workable prediction.
Confidence: Indicates the probability that something will happen if other specific factors come into play. For example, if a person owns a large car and has a long commute to work, there is a higher probability that he or she will purchase 500 or more gallons of gasoline in a year. This is known as a conditional probability, which is the probability that something will happen when some other specific thing happens. Confidence can also refer to a “confidence interval,” which is the statistical degree of error in an estimate that results from selecting one sample as opposed to a different sample.
Customer lifetime value (CLTV) measure: A metric of how much a customer is likely to profit a company over a period of time.
Data: Useful information on factors such as what customers have bought in the past, what they are buying now, the attributes of the products they’ve bought, the demographics factors of customers, etc. Data can come in two general flavors:
- Structured data: such as information on age, gender, income, sales, cost of products, product features, etc.
- Unstructured data: such as text data in comments, social media content, call center notes, etc.
Another lens to look at data is in terms of how it is used in the predictive modeling process. From this perspective, data can be used for either training or testing:
- Test data: Data that is used only at the end of the model-development process, when you are testing the model.
- Training data: The data that is used to fit a predictive model.
Decision Trees: These are a popular method for developing and visualizing predictive models and algorithms. Like flow charts, decision trees start at the top and split off into “branches” from there. You follow different branches depending on which questions you’re trying to answer. For example, the top box in a decision tree (aka, the root) may contain a question such as “Who buys our product” and then split into two groups, such as “Those who buy one time” and “Those who buy multiple times.” Then these boxes split into further categories until you have a detailed picture of your various types of customers. In order to learn about any given customer, you follow each branch until you arrive at the bottom of the tree.
Holdout sample: A sample of observations that are not used when an expert conducts the kind of analysis that leads to a predictive model. The idea is that analyst can go back to the model to see if it can predict the data of the holdout sample.
Learning: This tends to come in two varieties:
- Supervised learning is a technique, or group of techniques, that provides an algorithm with data that is explicitly labeled the kind of output that is desired. For example, if an analyst is “teaching” an algorithm to recognize human faces, then the analyst provides images of faces that are labeled as such. The algorithm then learns how to identify faces in a set of images in which faces are not labeled.
- Unsupervised learning means the algorithm finds patterns in unlabeled data that do not have a defined response measure. It clusters images into various groups. In this way, it “learns” that images of faces are different from images without faces.
Lift: A metric of the effectiveness of a predictive model. It is the ratio between the results the analyst gets with and without the use of the predictive model. For example, if only 10% of a group of catalog recipients normally makes a purchase from it but a predictive model estimates that a new mailing list could ensure that 30% of email recipients make a purchase, then the lift curve is determined from these numbers. A lift chart plots the lift along an x and y axis.
“Next best offer” or product recommendation capability: A prediction of the product your customer is most likely to buy next.
Observation: The unit of analysis on which the measurements are taken. That is, it is the raw data points that are used to predict a specific outcome. It can consist of a customer characteristic, a transaction, a product feature, etc.
Pattern: Specifically, this can be a set of measurements on someone or something, such as the age, weight and height of a person. More generally, a pattern is something that can be found in the data that helps the predictive analyst make a prediction.
Predictive model: This consists of a number of predictor variables (see variables) that are likely to influence future behavior or results. For example, a person’s income level might help predict how much gasoline they will use in a year’s time.
Predictor: This is a variable that helps predict certain outcomes. See variables.
Regression analysis: This is the statistical tool of choice for predictive analytics. The idea is that the analyst has a bunch of independent variables (see variables) such as gender, age, income, social media habits, and education and then sees how well some combination of these factors predict the purchase of a product.
Score: The term “scoring new data” means to use a model to predict output when using new data.
Variables: These are factors that, as the name implies, can vary. These are the factors that the prediction analyst measures. Variables are the key players in most hypotheses. For example, a research analyst may go into a study hypothesizing that the size of a consumer’s vehicle determines how much gas he or she will buy in a year. There are two kinds of variables:
- Independent variables: also known as predictor or experimental variables, these variables are thought to cause changes in some other variable. For example, in the example above, the size of the consumer’s vehicle is a independent variable that helps predict how much gas she or he will buy in a year.
- Dependent variables: also known as the outcome variables, these variables are dependent on the predictor variables. They represent the “outcomes” the analysts wants to explore. In the example above, the amount of gas that a person buys in a year is the dependent variable.
When doing basic research, the experimenter is careful to control independent and dependent variables because they want to figure out what factor or factors cause certain things to happen. In predictive analytics, however, the causality may be less important than the overall pattern of relationships. That is, the analyst may not care exactly why you’re buying more gasoline as long as the prediction model can anticipate whether or not you do.
The list above is far from all-inclusive, but it does serve as a kind of primer for those who are still relatively new to the subject. If there are other terms you’d like to see added to this list, please contact us to make your recommendations.