Rust LLVM time analysis

A note on measurement noise and generalizability

To start, there a few important things that are not captured in the data, but are good to think about.

First is measurement noise. You can divide this into two parts: repeatability & reproducibility. Repeatability: what if we repeat the test on the same machine, same data. Will we get the same timing results? No, the results are slightly different. This is why the Rust project also measures instruction counts in their benchmarks. This also means we cannot predict better than this timing noise. With the current data, we cannot even estimate how large this noise is.

Reproducibility is how the measurement differs in other ways: different systems (e.g. CPU), different Rust versions, different LLVM versions. These are important because we want our results to generalize to other systems and also be robust in the future. Similarly, it also means we cannot predict better than this noise.

Finally, to get realistic results, we must have a varied dataset. It need not contains the most weird and exceptional Rust code, but should cover different important scenarios. This also helps in fitting better models, since it can rule out models which seem good on simpler data. I do not know enough about the dataset to see if this is the case.

Loading data

Loading the data, adding a “cons = 1” input, and merging the debug/release data. We ignore the cgu_name, since we cannot use it in the analysis.

data_opt <- read.csv('Opt-Primary.txt') %>%
  mutate(mode="opt", cons=1)
data_dbg <- read.csv('Debug-Primary.txt') %>%
  mutate(mode="dbg", cons=1)

#Combine data, and split the data into "folds" for less overfitting.
#   Every testcase (first part of CGU) might have its own patterns, so we do not want to overfit on that.
data_all <- rbind(data_opt, data_dbg) %>%
  mutate(testcase=factor(str_split_i(cgu_name, fixed('.'), 1))) %>%
  fold(k=5, id_col="testcase") %>% ungroup() %>%
  rename(foldid=.folds) %>%
  select(-cgu_name)
knitr::kable(head(data_all))

testcase	sttic	root_	inlnd	bb___	ssd__	stmt_	term_	a_msg	rval_	place	proj_	proje	const	ty_cn	ty___	regn_	args_	lc_dc	vdi__	local	argc_	est__	t_all__	t_gen__	t_opt__	t_lto__	mode	cons	foldid
CGU-bitmaps	0	50	18	312	995	2074	312	1	805	2710	2710	279	401	0	1810	16	693	954	863	4001	111	2386	79.2	4.0	42.1	33.1	opt	1	2
CGU-bitmaps	0	20	13	286	583	1658	286	1	596	1928	1928	287	244	7	1297	65	395	688	529	3022	38	1944	134.1	2.0	80.3	51.8	opt	1	2
CGU-bitmaps	0	115	8	736	1043	2269	736	23	1115	3867	3867	500	568	7	2680	92	630	1469	928	5126	173	3005	70.5	13.1	57.4	0.0	dbg	1	2
CGU-cargo	0	3963	1504	38747	103575	239139	38747	16	89866	326403	326403	70120	28120	0	214264	7206	72391	116520	92645	481037	9715	277886	7784.9	266.7	3751.1	3767.2	opt	1	5
CGU-cargo	0	872	971	19985	41887	116407	19985	373	44386	152976	152976	17498	19628	38	88636	3004	24831	51060	41100	226236	4108	136392	7529.6	96.0	5039.6	2393.9	opt	1	5
CGU-cargo	2	1067	2432	30381	34232	95842	30381	154	40208	138372	138372	43011	19371	53	99381	4091	23200	49845	34376	197581	6248	126225	9167.5	208.6	5278.5	3680.4	opt	1	5

Plots of the data

It is always a good idea to make plot of your data.

As you can see, many thing have a reasonably strong relation to the LLVM time t_all__. Later on we will make a selection in this list.

sttic

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Transformation introduced infinite values in continuous x-axis

root_

inlnd

## Warning: Transformation introduced infinite values in continuous x-axis

bb___

ssd__

stmt_

term_

a_msg

## Warning: Transformation introduced infinite values in continuous x-axis

rval_

place

proj_

proje

const

ty_cn

## Warning: Transformation introduced infinite values in continuous x-axis

ty___

regn_

## Warning: Transformation introduced infinite values in continuous x-axis

args_

## Warning: Transformation introduced infinite values in continuous x-axis

lc_dc

vdi__

local

argc_

Modeling our data

The goal is to get a simple function that predicts t_all__. Furthermore, the input data is a sum over all functions in the CGU, so the function must be linear in the inputs (multiplications, non-linearities etc. do not make sense). This means we want a function like this:

\[\hat t = \beta_0 + \beta_1 \cdot \text{input}_1 + \beta_2 \cdot \text{input}_2 + \ldots\]

In addition, we want the relative error to be small, so we transform the data and formula to:

\[\frac{\hat t}{t} = \frac{\beta_0}{t} + \beta_1 \cdot \frac{\text{input}_1}{t} + \beta_2 \cdot \frac{\text{input}_2}{t} + \ldots\]

(Normally, I’d prefer a log transform, since 2x higher is just as bad as 2x lower. Here we cannot do this, because we cannot do regression on that value: it would imply that the factors multiply instead of add together.)

#Filter out times below 50ms, see below.
data_transformed <- data_all %>%
  filter(t_all__ > 50)
data_transformed <- data_transformed %>%
  mutate_at(vars(-mode,-foldid,-testcase), function (x) { x / data_transformed$t_all__})

“Perfect” model

Here we fit an “perfect” model: allowing negative coefficients, different coefficients for debug/release, overfitting, etc. This is useful, because it will show what is the best we can reach. In this, and all later models, we ignore the CGU’s below 50 ms, because the measurement error on t_all__ has a large influence on the fit.

#Fit a model without constant, we have our own constant
fit_perfect <- lm(t_all__ ~ 0 + (cons + sttic + root_ + inlnd + bb___ + ssd__ + stmt_ +
  term_ + a_msg + rval_ + place + proj_ + proje + const + ty_cn + ty___ + regn_ +
  args_ + lc_dc + vdi__ + local + argc_):mode, data_transformed)

Real model

Now we look at a more realistic model. We prevent overfitting by using the Lasso method (a linear regression variant which prevents overfitting. As a side effect it often sets coefficients to zero). We also supply the minimum value of zero, since we do not want negative coefficients.

We will fit it separately for Debug and Release mode, since the LLVM time values have very different scales.

#Normalize coefficients, since we are interested in which ones have a better predictive power
data_tofit_norm <- data_transformed %>%
  mutate_at(vars(-t_all__, -mode, -testcase, -foldid), function(x) { x / sd(x) })
data_tofit_norm_dbg <- data_tofit_norm %>% filter(mode == "dbg")
data_tofit_norm_opt <- data_tofit_norm %>% filter(mode == "opt")

f = t_all__ ~ 0 + cons + sttic + root_ + inlnd + bb___ + ssd__ + stmt_ +
  term_ + a_msg + rval_ + place + proj_ + proje + const + ty_cn + ty___ + regn_ +
  args_ + lc_dc + vdi__ + local + argc_


fit_dbg <- glmnetUtils::cv.glmnet(f,
                                  data_tofit_norm_dbg,
                                  intercept=FALSE,
                                  standardize=FALSE,
                                  foldid=as.numeric(data_tofit_norm_dbg$foldid),
                                  lower.limits=0)
fit_opt <- glmnetUtils::cv.glmnet(f,
                                  data_tofit_norm_opt,
                                  intercept=FALSE,
                                  standardize=FALSE,
                                  foldid=as.numeric(data_tofit_norm_opt$foldid),
                                  lower.limits=0)
#Debug
print(coef(fit_dbg, s=fit_dbg$lambda.min))

## 23 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) .         
## cons        0.04310224
## sttic       .         
## root_       .         
## inlnd       .         
## bb___       0.13917455
## ssd__       .         
## stmt_       .         
## term_       0.01016885
## a_msg       .         
## rval_       .         
## place       .         
## proj_       .         
## proje       0.03461849
## const       .         
## ty_cn       .         
## ty___       0.07772235
## regn_       .         
## args_       .         
## lc_dc       .         
## vdi__       .         
## local       0.06164122
## argc_       0.05352078

#Release
print(coef(fit_opt, s=fit_opt$lambda.min))

## 23 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) .         
## cons        .         
## sttic       .         
## root_       0.17195007
## inlnd       0.17520035
## bb___       0.02759081
## ssd__       .         
## stmt_       .         
## term_       0.19361250
## a_msg       .         
## rval_       .         
## place       .         
## proj_       .         
## proje       0.31055497
## const       0.13832327
## ty_cn       0.01849399
## ty___       .         
## regn_       0.29864183
## args_       .         
## lc_dc       .         
## vdi__       .         
## local       .         
## argc_       .

The Lasso method made an automatic selection of inputs to use. We should ignore the scale of the coefficients, and see which ones are relatively high or low, ideally for both debug and release mode.

You can see that the only ones with non-zero coefficients in both modes are “bb___” and “proje”. Let’s try to make a model with only these two.

Update: now with taking grouping of CGU’s into account, we also have “term_”. Let’s skip that for now, even though it might make a better model.

#Now use unnormalized data, since we want to see the real coefficients
data_transformed_dbg <- data_transformed %>% filter(mode == "dbg")
data_transformed_opt <- data_transformed %>% filter(mode == "opt")

f = t_all__ ~ bb___ + proje

fit_dbg <- glmnetUtils::cv.glmnet(f,
                                  data_transformed_dbg,
                                  intercept=FALSE,
                                  foldid=as.numeric(data_transformed_dbg$foldid),
                                  lower.limits=0)
fit_opt <- glmnetUtils::cv.glmnet(f,
                                  data_transformed_opt,
                                  intercept=FALSE,
                                  foldid=as.numeric(data_transformed_opt$foldid),
                                  lower.limits=0)
#Debug
print(coef(fit_dbg, s=fit_dbg$lambda.min))

## 3 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) .         
## bb___       0.05082782
## proje       0.02219612

#Release
print(coef(fit_opt, s=fit_opt$lambda.min))

## 3 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept) .         
## bb___       0.12139737
## proje       0.09532587

We see that “bb___” is ~2x the “proje” coefficient in debug mode, and ~1.25x the “proje” coefficient in release mode. Let’s use something in the middle, and call it the new model:

\[\hat t = 7 \cdot \text{bb___} + 4 \cdot \text{proje}\]

Comparing results

We have two new predictions above (perfect model and our new model), and an old method (est__). Let us compare these. To compare, we first scale all values, so they are (on average) equal to the t_all__ value. This is fine because we are aiming to create a relative prediction.

data_withpred <- data_all %>%
  mutate(pred_est=est__,
         pred_perfect=predict(fit_perfect, .),
         pred_new=7 * bb___ + 4 * proje
  ) %>%
  pivot_longer(c(pred_est, pred_perfect, pred_new), names_to="model", values_to="pred")

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `pred_perfect = predict(fit_perfect, .)`.
## Caused by warning in `predict.lm()`:
## ! prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

#Adjust for release and debug modes
adjust_factor <- data_withpred %>%
  filter(t_all__ > 50) %>%
  mutate(rel=pred / t_all__) %>%
  group_by(mode, model) %>%
  summarize(adjust=1/mean(rel), .groups="drop")

data_withpred <- data_withpred %>%
  inner_join(adjust_factor, by=join_by(mode, model)) %>%
  mutate(pred=pred*adjust)

print(ggplot(data_withpred, aes(x=pred, y=t_all__, color=model)) + geom_point() +
  scale_x_log10() + scale_y_log10() + geom_abline() + facet_grid(cols=vars(mode)))

A prediction is better if it is vertically closer to the x=y line. Here you see that the predictions generally look quite similar, with the “perfect” and “new” methods a tiny bit closer to the lines. Let us also calculate the mean relative error (for t_all__ > 50 ms):

err <- data_withpred %>%
  mutate(rel=pred / t_all__) %>%
  filter(t_all__ > 50) %>%
  group_by(mode, model) %>%
  summarize(relative_error=sd(rel))

## `summarise()` has grouped output by 'mode'. You can override using the
## `.groups` argument.

knitr::kable(err)

mode	model	relative_error
dbg	pred_est	0.3600314
dbg	pred_new	0.2273454
dbg	pred_perfect	0.1885423
opt	pred_est	0.5120801
opt	pred_new	0.3690356
opt	pred_perfect	0.2510631

This shows that we cannot do better than ~18%/25% relative error for debug/release modes (standard errors). We also see that the new method does a bit better than the “est__” method.

A few final notes

From the results above we see that there are many reasonable models to predict the LLVM time. We chose one here, but it’s better to use some domain knowledge to improve on this. E.g. why is the “sttic” input used in the old model, even though it barely has any predictive power? The answer to questions such as these can help make a model more robust now and in the future.

Note that getting more data is not the answer, since the problem in the data is multicollinearity (the inputs are very similar), making it difficult to figure out which inputs are important and which not. The multicollinearity also means that it does not really matter: a lot of models fit the data reasonably well. So why not pick a simple one (preferably with some domain knowledge)? And for a simple model you do not need a large dataset.