How to use Linear Models in Einstein Analytics without any SAQL

Ed Mengel
7 min readDec 12, 2018

There’s a feature from the Einstein Analytics Spring ’18 release called “Raw fields formulas”. This is an incredibly powerful feature that at first glance, one might not see all of the advanced things you can do with it. I’m going to show you how you can do an incredibly advanced use case, linear modeling, without using ANY SAQL!

In our contrived example, we are going to assume that one measure can provide some insight to extimate another measure. The first measure will assume the value is relatively stable across the deal lifecycle. That is to say it’s easy to know up front. For the second measure, we are assuming that it’s not final until the deal is marked “Closed — Won”, hence the need to predict it’s actual value. In our case, let’s use the Annual Revenue from the parent account, and see if it can help predict the size (Amount) of an individual deal. Our hypothesis is “Large Accounts should generate large deals”, and although there will be exceptions, digging into the “why” may be very informative. In practice, you could use this where any known up front field possibly predicts another field, or you could use this for time based forecasting if you use a date field’s epoch value as the independent variable, but you’re probably better off using the new Timeseries SAQL function for any date based modeling. To learn more about Timeseries, check out Antonio Scaramuzzino’s blog post and Rikke Hovgaard’s blog Post.

What are “raw fields formulas”?

Prior to Spring ’18, formulas built using the Compare Table could only reference other columns. They could do simple math on existing columns, or they could do window functions like sliding window average, running total, or rank of existing column. This was a great way to compositionally build up complex logic. The limitation was that you couldn’t reference fields from the original dataset pre aggregation. The two most common use cases here are percentiles and linear regression which need access to the pre-aggregated data to function properly. To learn more about the old world formulas, see this post.

Pro tip:A single formula can referent columns OR raw fields, but not both within the same formula. A single table (or “step”) can have formulas of both types within it.

Creating a Model

First things first, let’s create a model using the compare table. I’ll open up my dataset in Explorer, and switch to the compare table visualization. Next, I’ll change the first measure to the average of my independent variable, and the second to the average of my dependent variable. You don’t really need to do this, but it may help for sanity checks later.

Now, I’m going to use the formula editor to create three more raw fields formulas. In my case, I’m using ‘Account AnnualRevenue’ to predict ‘Amount’. In your case, you may need to adjust the arguments to the two fields you are using from your dataset.

Pro tip:Click the function picker above the formula text area, then press the “r” key on your keyboard to instantly scroll down to the R functions. Once the formula is inserted, you can highlight the left and right arguments and replace there names by selecting from the field picker on the right (once for each argument).

Column C:

regr_slope(Amount, 'Account.AnnualRevenue')

Column D:

regr_intercept(Amount, 'Account.AnnualRevenue')

Column E (optional):

regr_r2(Amount, 'Account.AnnualRevenue')

Also, for my training data, I only want to incorporate deals that I have won, so I’m going to add an additional filter for Stage Name is equal to “Closed Won”. This is to prevent incomplete or “censored” data from affecting my model.

After which my table looks like this:

Table results for our model

So I will add this as a step to my dashboard. I’ll name it “model”. The system will give it a dev name of “model_1”. (This name is important later as you will need it to get your bindings right.)

Creating a target step on which to apply your model

Now create a target step, Let’s just start by cloning our model step. First, I’ll add a grouping. I’m going to group by Sales Rep, also known as “Owner Name”. I’m going to modify the filter on the step so that it is only applying the model to open opportunities, since those are the ones I want the most insights on. I’ll modify the Lens Filter on the compare table like the below image.

Now I’m going to modify the 3 formulas.

I’ll replace the slope with some placeholder text like “123” (you’ll find out why below).

123

The Intercept will be “456”

456

For the fifth column, I will change the name to “Predicted”, and the formula to:

C*A+D

For clarity I’ve also gone and changed the names of the first two columns to “Independent” and “Actual”. This produces the below table.

Table for our model predictions

Configure your chart

Now let’s find a way to visually present the data. First, I’ll hide the independent variable, the slope, and the intercept columns. This is easy to do by dragging the measures into the “Additional Fields” category on the left hand side. I’m going to use a horizontal bar chart, and set the Axis Mode to Single Axis, so I can compare my Actual and Predicted values. It may look really bad now, but that’s because we haven’t really applied the model yet. Don’t worry, it will get better!

First stab at our chart

I’m going to add this step to my dashboard as a new step named “target”.

Bindings

Now let’s edit the bindings. Remember the values you entered in the formula editor? First hit cmd-e (or ctrl-e on PC), and search for the string “123” (on Chrome on a Mac you can just hit cmd-f).

Instead of “123” insert the slope from our model (column C):

"{{cell(model_1.result, 0, \"C\").asString()}}"

Instead of “456” insert the intercept from our model (column D):

"{{cell(model_1.result, 0, \"D\").asString()}}"

Now we have the model coefficients being dynamically inserted into our new step at run time calculating a predicted value for each row based on the average of the independent variable for that row. If you grouped by Opportunity Name then this would be comparing each deal to the standard across all closed deals. In this case we are grouping by rep, so the thought is that across all their open deals we may expect them to be closer to the average. Having one deal as an outlier may not be as much of a big deal as the average of all your open deals being an outlier. If the average for that rep is far off the predicted value, it could be a sign of a systemic issue.

Now, let’s add faceting so that it’s even more dynamic…

Add some filtering or faceting

First off, we want our model to be as smart as possible, so I’m going to disable global filters and faceting, so that the model’s coefficients are calculated using the maximal set of data.

Now, let’s make this more interesting. I’ll just add a Single list picker based on the dimension “Forecast Category”, but feel free to add a global filter bar, other pickers, or other visualizations.

Finish

Now let’s look at the final product.

Final Dashboard

I can compare my predicted vs actuals for every rep and drill down or explore for where the discrepencies are the highest. I can even pick a particular forecast category to see if the accuracy improves or degrades, and in what direction. This is a great way to evaluate the efficacy of your model.

One additional thing you could do is create another hidden column of “Predicted-Actuals” or “100*(Predicted-Actuals)/Actuals” or “abs(Predicted-Actuals)/Actuals” and then sort by that value descending to find the biggest outliers and deal with them first, but I will leave that as an exercise for the reader! In theory, you could also take the residuals (Actuals-Predicted) of the first model and plug them into a second model to use Gradient Boosting to create a multi variate model, but maybe that’s a future blog post.

--

--

Ed Mengel

Interests in Big Data, Search, Data Analytics, Graphs, Great Products, and Martial Arts