A note on measurement noise and generalizability

To start, there a few important things that are not captured in the data, but are good to think about.

First is measurement noise. You can divide this into two parts: repeatability & reproducibility. Repeatability: what if we repeat the test on the same machine, same data. Will we get the same timing results? No, the results are slightly different. This is why the Rust project also measures instruction counts in their benchmarks. This also means we cannot predict better than this timing noise. With the current data, we cannot even estimate how large this noise is.

Reproducibility is how the measurement differs in other ways: different systems (e.g. CPU), different Rust versions, different LLVM versions. These are important because we want our results to generalize to other systems and also be robust in the future. Similarly, it also means we cannot predict better than this noise.

Finally, to get realistic results, we must have a varied dataset. It need not contains the most weird and exceptional Rust code, but should cover different important scenarios. This also helps in fitting better models, since it can rule out models which seem good on simpler data. I do not know enough about the dataset to see if this is the case.

Loading data

Loading the data, adding a “cons = 1” input, and merging the debug/release data. We ignore the cgu_name, since we cannot use it in the analysis.

data_opt <- read.csv('Opt-Primary.txt') %>%
  mutate(mode="opt", cons=1)
data_dbg <- read.csv('Debug-Primary.txt') %>%
  mutate(mode="dbg", cons=1)

#Combine data, and split the data into "folds" for less overfitting.
#   Every testcase (first part of CGU) might have its own patterns, so we do not want to overfit on that.
data_all <- rbind(data_opt, data_dbg) %>%
  mutate(testcase=factor(str_split_i(cgu_name, fixed('.'), 1))) %>%
  fold(k=5, id_col="testcase") %>% ungroup() %>%
  rename(foldid=.folds) %>%
  select(-cgu_name)
knitr::kable(head(data_all))
testcase sttic root_ inlnd bb___ ssd__ stmt_ term_ a_msg rval_ place proj_ proje const ty_cn ty___ regn_ args_ lc_dc vdi__ local argc_ est__ t_all__ t_gen__ t_opt__ t_lto__ mode cons foldid
CGU-bitmaps 0 50 18 312 995 2074 312 1 805 2710 2710 279 401 0 1810 16 693 954 863 4001 111 2386 79.2 4.0 42.1 33.1 opt 1 2
CGU-bitmaps 0 20 13 286 583 1658 286 1 596 1928 1928 287 244 7 1297 65 395 688 529 3022 38 1944 134.1 2.0 80.3 51.8 opt 1 2
CGU-bitmaps 0 115 8 736 1043 2269 736 23 1115 3867 3867 500 568 7 2680 92 630 1469 928 5126 173 3005 70.5 13.1 57.4 0.0 dbg 1 2
CGU-cargo 0 3963 1504 38747 103575 239139 38747 16 89866 326403 326403 70120 28120 0 214264 7206 72391 116520 92645 481037 9715 277886 7784.9 266.7 3751.1 3767.2 opt 1 5
CGU-cargo 0 872 971 19985 41887 116407 19985 373 44386 152976 152976 17498 19628 38 88636 3004 24831 51060 41100 226236 4108 136392 7529.6 96.0 5039.6 2393.9 opt 1 5
CGU-cargo 2 1067 2432 30381 34232 95842 30381 154 40208 138372 138372 43011 19371 53 99381 4091 23200 49845 34376 197581 6248 126225 9167.5 208.6 5278.5 3680.4 opt 1 5

Plots of the data

It is always a good idea to make plot of your data.

As you can see, many thing have a reasonably strong relation to the LLVM time t_all__. Later on we will make a selection in this list.

sttic

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Transformation introduced infinite values in continuous x-axis

root_

inlnd

## Warning: Transformation introduced infinite values in continuous x-axis

bb___

ssd__

stmt_

term_

a_msg

## Warning: Transformation introduced infinite values in continuous x-axis

rval_

place

proj_

proje

const

ty_cn

## Warning: Transformation introduced infinite values in continuous x-axis

ty___

regn_

## Warning: Transformation introduced infinite values in continuous x-axis

args_

## Warning: Transformation introduced infinite values in continuous x-axis

lc_dc

vdi__

local

argc_

Modeling our data

The goal is to get a simple function that predicts t_all__. Furthermore, the input data is a sum over all functions in the CGU, so the function must be linear in the inputs (multiplications, non-linearities etc. do not make sense). This means we want a function like this:

\[\hat t = \beta_0 + \beta_1 \cdot \text{input}_1 + \beta_2 \cdot \text{input}_2 + \ldots\]

In addition, we want the relative error to be small, so we transform the data and formula to:

\[\frac{\hat t}{t} = \frac{\beta_0}{t} + \beta_1 \cdot \frac{\text{input}_1}{t} + \beta_2 \cdot \frac{\text{input}_2}{t} + \ldots\]

(Normally, I’d prefer a log transform, since 2x higher is just as bad as 2x lower. Here we cannot do this, because we cannot do regression on that value: it would imply that the factors multiply instead of add together.)

#Filter out times below 50ms, see below.
data_transformed <- data_all %>%
  filter(t_all__ > 50)
data_transformed <- data_transformed %>%
  mutate_at(vars(-mode,-foldid,-testcase), function (x) { x / data_transformed$t_all__})

“Perfect” model

Here we fit an “perfect” model: allowing negative coefficients, different coefficients for debug/release, overfitting, etc. This is useful, because it will show what is the best we can reach. In this, and all later models, we ignore the CGU’s below 50 ms, because the measurement error on t_all__ has a large influence on the fit.

#Fit a model without constant, we have our own constant
fit_perfect <- lm(t_all__ ~ 0 + (cons + sttic + root_ + inlnd + bb___ + ssd__ + stmt_ +
  term_ + a_msg + rval_ + place + proj_ + proje + const + ty_cn + ty___ + regn_ +
  args_ + lc_dc + vdi__ + local + argc_):mode, data_transformed)

Real model

Now we look at a more realistic model. We prevent overfitting by using the Lasso method (a linear regression variant which prevents overfitting. As a side effect it often sets coefficients to zero). We also supply the minimum value of zero, since we do not want negative coefficients.

We will fit it separately for Debug and Release mode, since the LLVM time values have very different scales.

#Normalize coefficients, since we are interested in which ones have a better predictive power
data_tofit_norm <- data_transformed %>%
  mutate_at(vars(-t_all__, -mode, -testcase, -foldid), function(x) { x / sd(x) })
data_tofit_norm_dbg <- data_tofit_norm %>% filter(mode == "dbg")
data_tofit_norm_opt <- data_tofit_norm %>% filter(mode == "opt")

f = t_all__ ~ 0 + cons + sttic + root_ + inlnd + bb___ + ssd__ + stmt_ +
  term_ + a_msg + rval_ + place + proj_ + proje + const + ty_cn + ty___ + regn_ +
  args_ + lc_dc + vdi__ + local + argc_


fit_dbg <- glmnetUtils::cv.glmnet(f,
                                  data_tofit_norm_dbg,
                                  intercept=FALSE,
                                  standardize=FALSE,
                                  foldid=as.numeric(data_tofit_norm_dbg$foldid),
                                  lower.limits=0)
fit_opt <- glmnetUtils::cv.glmnet(f,
                                  data_tofit_norm_opt,
                                  intercept=FALSE,
                                  standardize=FALSE,
                                  foldid=as.numeric(data_tofit_norm_opt$foldid),
                                  lower.limits=0)
#Debug
print(coef(fit_dbg, s=fit_dbg$lambda.min))
## 23 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) .         
## cons        0.04310224
## sttic       .         
## root_       .         
## inlnd       .         
## bb___       0.13917455
## ssd__       .         
## stmt_       .         
## term_       0.01016885
## a_msg       .         
## rval_       .         
## place       .         
## proj_       .         
## proje       0.03461849
## const       .         
## ty_cn       .         
## ty___       0.07772235
## regn_       .         
## args_       .         
## lc_dc       .         
## vdi__       .         
## local       0.06164122
## argc_       0.05352078
#Release
print(coef(fit_opt, s=fit_opt$lambda.min))
## 23 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) .         
## cons        .         
## sttic       .         
## root_       0.17195007
## inlnd       0.17520035
## bb___       0.02759081
## ssd__       .         
## stmt_       .         
## term_       0.19361250
## a_msg       .         
## rval_       .         
## place       .         
## proj_       .         
## proje       0.31055497
## const       0.13832327
## ty_cn       0.01849399
## ty___       .         
## regn_       0.29864183
## args_       .         
## lc_dc       .         
## vdi__       .         
## local       .         
## argc_       .

The Lasso method made an automatic selection of inputs to use. We should ignore the scale of the coefficients, and see which ones are relatively high or low, ideally for both debug and release mode.

You can see that the only ones with non-zero coefficients in both modes are “bb___” and “proje”. Let’s try to make a model with only these two.

Update: now with taking grouping of CGU’s into account, we also have “term_”. Let’s skip that for now, even though it might make a better model.

#Now use unnormalized data, since we want to see the real coefficients
data_transformed_dbg <- data_transformed %>% filter(mode == "dbg")
data_transformed_opt <- data_transformed %>% filter(mode == "opt")

f = t_all__ ~ bb___ + proje

fit_dbg <- glmnetUtils::cv.glmnet(f,
                                  data_transformed_dbg,
                                  intercept=FALSE,
                                  foldid=as.numeric(data_transformed_dbg$foldid),
                                  lower.limits=0)
fit_opt <- glmnetUtils::cv.glmnet(f,
                                  data_transformed_opt,
                                  intercept=FALSE,
                                  foldid=as.numeric(data_transformed_opt$foldid),
                                  lower.limits=0)
#Debug
print(coef(fit_dbg, s=fit_dbg$lambda.min))
## 3 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) .         
## bb___       0.05082782
## proje       0.02219612
#Release
print(coef(fit_opt, s=fit_opt$lambda.min))
## 3 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) .         
## bb___       0.12139737
## proje       0.09532587

We see that “bb___” is ~2x the “proje” coefficient in debug mode, and ~1.25x the “proje” coefficient in release mode. Let’s use something in the middle, and call it the new model:

\[\hat t = 7 \cdot \text{bb___} + 4 \cdot \text{proje}\]

Comparing results

We have two new predictions above (perfect model and our new model), and an old method (est__). Let us compare these. To compare, we first scale all values, so they are (on average) equal to the t_all__ value. This is fine because we are aiming to create a relative prediction.

data_withpred <- data_all %>%
  mutate(pred_est=est__,
         pred_perfect=predict(fit_perfect, .),
         pred_new=7 * bb___ + 4 * proje
  ) %>%
  pivot_longer(c(pred_est, pred_perfect, pred_new), names_to="model", values_to="pred")
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `pred_perfect = predict(fit_perfect, .)`.
## Caused by warning in `predict.lm()`:
## ! prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
#Adjust for release and debug modes
adjust_factor <- data_withpred %>%
  filter(t_all__ > 50) %>%
  mutate(rel=pred / t_all__) %>%
  group_by(mode, model) %>%
  summarize(adjust=1/mean(rel), .groups="drop")

data_withpred <- data_withpred %>%
  inner_join(adjust_factor, by=join_by(mode, model)) %>%
  mutate(pred=pred*adjust)

print(ggplot(data_withpred, aes(x=pred, y=t_all__, color=model)) + geom_point() +
  scale_x_log10() + scale_y_log10() + geom_abline() + facet_grid(cols=vars(mode)))

A prediction is better if it is vertically closer to the x=y line. Here you see that the predictions generally look quite similar, with the “perfect” and “new” methods a tiny bit closer to the lines. Let us also calculate the mean relative error (for t_all__ > 50 ms):

err <- data_withpred %>%
  mutate(rel=pred / t_all__) %>%
  filter(t_all__ > 50) %>%
  group_by(mode, model) %>%
  summarize(relative_error=sd(rel))
## `summarise()` has grouped output by 'mode'. You can override using the
## `.groups` argument.
knitr::kable(err)
mode model relative_error
dbg pred_est 0.3600314
dbg pred_new 0.2273454
dbg pred_perfect 0.1885423
opt pred_est 0.5120801
opt pred_new 0.3690356
opt pred_perfect 0.2510631

This shows that we cannot do better than ~18%/25% relative error for debug/release modes (standard errors). We also see that the new method does a bit better than the “est__” method.

A few final notes

From the results above we see that there are many reasonable models to predict the LLVM time. We chose one here, but it’s better to use some domain knowledge to improve on this. E.g. why is the “sttic” input used in the old model, even though it barely has any predictive power? The answer to questions such as these can help make a model more robust now and in the future.

Note that getting more data is not the answer, since the problem in the data is multicollinearity (the inputs are very similar), making it difficult to figure out which inputs are important and which not. The multicollinearity also means that it does not really matter: a lot of models fit the data reasonably well. So why not pick a simple one (preferably with some domain knowledge)? And for a simple model you do not need a large dataset.