To start, there a few important things that are not captured in the data, but are good to think about.
First is measurement noise. You can divide this into two parts: repeatability & reproducibility. Repeatability: what if we repeat the test on the same machine, same data. Will we get the same timing results? No, the results are slightly different. This is why the Rust project also measures instruction counts in their benchmarks. This also means we cannot predict better than this timing noise. With the current data, we cannot even estimate how large this noise is.
Reproducibility is how the measurement differs in other ways: different systems (e.g. CPU), different Rust versions, different LLVM versions. These are important because we want our results to generalize to other systems and also be robust in the future. Similarly, it also means we cannot predict better than this noise.
Finally, to get realistic results, we must have a varied dataset. It need not contains the most weird and exceptional Rust code, but should cover different important scenarios. This also helps in fitting better models, since it can rule out models which seem good on simpler data. I do not know enough about the dataset to see if this is the case.
Loading the data, adding a “cons = 1” input, and merging the debug/release data. We ignore the cgu_name, since we cannot use it in the analysis.
data_opt <- read.csv('Opt-Primary.txt') %>%
mutate(mode="opt", cons=1)
data_dbg <- read.csv('Debug-Primary.txt') %>%
mutate(mode="dbg", cons=1)
#Combine data, and split the data into "folds" for less overfitting.
# Every testcase (first part of CGU) might have its own patterns, so we do not want to overfit on that.
data_all <- rbind(data_opt, data_dbg) %>%
mutate(testcase=factor(str_split_i(cgu_name, fixed('.'), 1))) %>%
fold(k=5, id_col="testcase") %>% ungroup() %>%
rename(foldid=.folds) %>%
select(-cgu_name)
knitr::kable(head(data_all))
testcase | sttic | root_ | inlnd | bb___ | ssd__ | stmt_ | term_ | a_msg | rval_ | place | proj_ | proje | const | ty_cn | ty___ | regn_ | args_ | lc_dc | vdi__ | local | argc_ | est__ | t_all__ | t_gen__ | t_opt__ | t_lto__ | mode | cons | foldid |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CGU-bitmaps | 0 | 50 | 18 | 312 | 995 | 2074 | 312 | 1 | 805 | 2710 | 2710 | 279 | 401 | 0 | 1810 | 16 | 693 | 954 | 863 | 4001 | 111 | 2386 | 79.2 | 4.0 | 42.1 | 33.1 | opt | 1 | 2 |
CGU-bitmaps | 0 | 20 | 13 | 286 | 583 | 1658 | 286 | 1 | 596 | 1928 | 1928 | 287 | 244 | 7 | 1297 | 65 | 395 | 688 | 529 | 3022 | 38 | 1944 | 134.1 | 2.0 | 80.3 | 51.8 | opt | 1 | 2 |
CGU-bitmaps | 0 | 115 | 8 | 736 | 1043 | 2269 | 736 | 23 | 1115 | 3867 | 3867 | 500 | 568 | 7 | 2680 | 92 | 630 | 1469 | 928 | 5126 | 173 | 3005 | 70.5 | 13.1 | 57.4 | 0.0 | dbg | 1 | 2 |
CGU-cargo | 0 | 3963 | 1504 | 38747 | 103575 | 239139 | 38747 | 16 | 89866 | 326403 | 326403 | 70120 | 28120 | 0 | 214264 | 7206 | 72391 | 116520 | 92645 | 481037 | 9715 | 277886 | 7784.9 | 266.7 | 3751.1 | 3767.2 | opt | 1 | 5 |
CGU-cargo | 0 | 872 | 971 | 19985 | 41887 | 116407 | 19985 | 373 | 44386 | 152976 | 152976 | 17498 | 19628 | 38 | 88636 | 3004 | 24831 | 51060 | 41100 | 226236 | 4108 | 136392 | 7529.6 | 96.0 | 5039.6 | 2393.9 | opt | 1 | 5 |
CGU-cargo | 2 | 1067 | 2432 | 30381 | 34232 | 95842 | 30381 | 154 | 40208 | 138372 | 138372 | 43011 | 19371 | 53 | 99381 | 4091 | 23200 | 49845 | 34376 | 197581 | 6248 | 126225 | 9167.5 | 208.6 | 5278.5 | 3680.4 | opt | 1 | 5 |
It is always a good idea to make plot of your data.
As you can see, many thing have a reasonably strong relation to the LLVM time t_all__. Later on we will make a selection in this list.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
The goal is to get a simple function that predicts t_all__. Furthermore, the input data is a sum over all functions in the CGU, so the function must be linear in the inputs (multiplications, non-linearities etc. do not make sense). This means we want a function like this:
\[\hat t = \beta_0 + \beta_1 \cdot \text{input}_1 + \beta_2 \cdot \text{input}_2 + \ldots\]
In addition, we want the relative error to be small, so we transform the data and formula to:
\[\frac{\hat t}{t} = \frac{\beta_0}{t} + \beta_1 \cdot \frac{\text{input}_1}{t} + \beta_2 \cdot \frac{\text{input}_2}{t} + \ldots\]
(Normally, I’d prefer a log transform, since 2x higher is just as bad as 2x lower. Here we cannot do this, because we cannot do regression on that value: it would imply that the factors multiply instead of add together.)
#Filter out times below 50ms, see below.
data_transformed <- data_all %>%
filter(t_all__ > 50)
data_transformed <- data_transformed %>%
mutate_at(vars(-mode,-foldid,-testcase), function (x) { x / data_transformed$t_all__})
Here we fit an “perfect” model: allowing negative coefficients, different coefficients for debug/release, overfitting, etc. This is useful, because it will show what is the best we can reach. In this, and all later models, we ignore the CGU’s below 50 ms, because the measurement error on t_all__ has a large influence on the fit.
#Fit a model without constant, we have our own constant
fit_perfect <- lm(t_all__ ~ 0 + (cons + sttic + root_ + inlnd + bb___ + ssd__ + stmt_ +
term_ + a_msg + rval_ + place + proj_ + proje + const + ty_cn + ty___ + regn_ +
args_ + lc_dc + vdi__ + local + argc_):mode, data_transformed)
Now we look at a more realistic model. We prevent overfitting by using the Lasso method (a linear regression variant which prevents overfitting. As a side effect it often sets coefficients to zero). We also supply the minimum value of zero, since we do not want negative coefficients.
We will fit it separately for Debug and Release mode, since the LLVM time values have very different scales.
#Normalize coefficients, since we are interested in which ones have a better predictive power
data_tofit_norm <- data_transformed %>%
mutate_at(vars(-t_all__, -mode, -testcase, -foldid), function(x) { x / sd(x) })
data_tofit_norm_dbg <- data_tofit_norm %>% filter(mode == "dbg")
data_tofit_norm_opt <- data_tofit_norm %>% filter(mode == "opt")
f = t_all__ ~ 0 + cons + sttic + root_ + inlnd + bb___ + ssd__ + stmt_ +
term_ + a_msg + rval_ + place + proj_ + proje + const + ty_cn + ty___ + regn_ +
args_ + lc_dc + vdi__ + local + argc_
fit_dbg <- glmnetUtils::cv.glmnet(f,
data_tofit_norm_dbg,
intercept=FALSE,
standardize=FALSE,
foldid=as.numeric(data_tofit_norm_dbg$foldid),
lower.limits=0)
fit_opt <- glmnetUtils::cv.glmnet(f,
data_tofit_norm_opt,
intercept=FALSE,
standardize=FALSE,
foldid=as.numeric(data_tofit_norm_opt$foldid),
lower.limits=0)
#Debug
print(coef(fit_dbg, s=fit_dbg$lambda.min))
## 23 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) .
## cons 0.04310224
## sttic .
## root_ .
## inlnd .
## bb___ 0.13917455
## ssd__ .
## stmt_ .
## term_ 0.01016885
## a_msg .
## rval_ .
## place .
## proj_ .
## proje 0.03461849
## const .
## ty_cn .
## ty___ 0.07772235
## regn_ .
## args_ .
## lc_dc .
## vdi__ .
## local 0.06164122
## argc_ 0.05352078
#Release
print(coef(fit_opt, s=fit_opt$lambda.min))
## 23 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) .
## cons .
## sttic .
## root_ 0.17195007
## inlnd 0.17520035
## bb___ 0.02759081
## ssd__ .
## stmt_ .
## term_ 0.19361250
## a_msg .
## rval_ .
## place .
## proj_ .
## proje 0.31055497
## const 0.13832327
## ty_cn 0.01849399
## ty___ .
## regn_ 0.29864183
## args_ .
## lc_dc .
## vdi__ .
## local .
## argc_ .
The Lasso method made an automatic selection of inputs to use. We should ignore the scale of the coefficients, and see which ones are relatively high or low, ideally for both debug and release mode.
You can see that the only ones with non-zero coefficients in both modes are “bb___” and “proje”. Let’s try to make a model with only these two.
Update: now with taking grouping of CGU’s into account, we also have “term_”. Let’s skip that for now, even though it might make a better model.
#Now use unnormalized data, since we want to see the real coefficients
data_transformed_dbg <- data_transformed %>% filter(mode == "dbg")
data_transformed_opt <- data_transformed %>% filter(mode == "opt")
f = t_all__ ~ bb___ + proje
fit_dbg <- glmnetUtils::cv.glmnet(f,
data_transformed_dbg,
intercept=FALSE,
foldid=as.numeric(data_transformed_dbg$foldid),
lower.limits=0)
fit_opt <- glmnetUtils::cv.glmnet(f,
data_transformed_opt,
intercept=FALSE,
foldid=as.numeric(data_transformed_opt$foldid),
lower.limits=0)
#Debug
print(coef(fit_dbg, s=fit_dbg$lambda.min))
## 3 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) .
## bb___ 0.05082782
## proje 0.02219612
#Release
print(coef(fit_opt, s=fit_opt$lambda.min))
## 3 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) .
## bb___ 0.12139737
## proje 0.09532587
We see that “bb___” is ~2x the “proje” coefficient in debug mode, and ~1.25x the “proje” coefficient in release mode. Let’s use something in the middle, and call it the new model:
\[\hat t = 7 \cdot \text{bb___} + 4 \cdot \text{proje}\]
We have two new predictions above (perfect model and our new model), and an old method (est__). Let us compare these. To compare, we first scale all values, so they are (on average) equal to the t_all__ value. This is fine because we are aiming to create a relative prediction.
data_withpred <- data_all %>%
mutate(pred_est=est__,
pred_perfect=predict(fit_perfect, .),
pred_new=7 * bb___ + 4 * proje
) %>%
pivot_longer(c(pred_est, pred_perfect, pred_new), names_to="model", values_to="pred")
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `pred_perfect = predict(fit_perfect, .)`.
## Caused by warning in `predict.lm()`:
## ! prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
#Adjust for release and debug modes
adjust_factor <- data_withpred %>%
filter(t_all__ > 50) %>%
mutate(rel=pred / t_all__) %>%
group_by(mode, model) %>%
summarize(adjust=1/mean(rel), .groups="drop")
data_withpred <- data_withpred %>%
inner_join(adjust_factor, by=join_by(mode, model)) %>%
mutate(pred=pred*adjust)
print(ggplot(data_withpred, aes(x=pred, y=t_all__, color=model)) + geom_point() +
scale_x_log10() + scale_y_log10() + geom_abline() + facet_grid(cols=vars(mode)))
A prediction is better if it is vertically closer to the x=y line. Here you see that the predictions generally look quite similar, with the “perfect” and “new” methods a tiny bit closer to the lines. Let us also calculate the mean relative error (for t_all__ > 50 ms):
err <- data_withpred %>%
mutate(rel=pred / t_all__) %>%
filter(t_all__ > 50) %>%
group_by(mode, model) %>%
summarize(relative_error=sd(rel))
## `summarise()` has grouped output by 'mode'. You can override using the
## `.groups` argument.
knitr::kable(err)
mode | model | relative_error |
---|---|---|
dbg | pred_est | 0.3600314 |
dbg | pred_new | 0.2273454 |
dbg | pred_perfect | 0.1885423 |
opt | pred_est | 0.5120801 |
opt | pred_new | 0.3690356 |
opt | pred_perfect | 0.2510631 |
This shows that we cannot do better than ~18%/25% relative error for debug/release modes (standard errors). We also see that the new method does a bit better than the “est__” method.
From the results above we see that there are many reasonable models to predict the LLVM time. We chose one here, but it’s better to use some domain knowledge to improve on this. E.g. why is the “sttic” input used in the old model, even though it barely has any predictive power? The answer to questions such as these can help make a model more robust now and in the future.
Note that getting more data is not the answer, since the problem in the data is multicollinearity (the inputs are very similar), making it difficult to figure out which inputs are important and which not. The multicollinearity also means that it does not really matter: a lot of models fit the data reasonably well. So why not pick a simple one (preferably with some domain knowledge)? And for a simple model you do not need a large dataset.