Toyota ITC Benchmark

One of the applications you can use to demonstrate the capabilities of RV-Match is the benchmark generated by Toyota ITC to evaluate different static analysis tools. For more information on this benchmark, refer to the following ISSRE‘15 paper:

Shinichi Shiraishi, Veena Mohan, and Hemalatha Marimuthu. 2015. Test Suites for Benchmarks of Static Analysis Tools. In Proceedings of the 26th IEEE International Symposium on Software Reliability Engineering (ISSRE‘15).

The Toyota ITC benchmark is publicly available on Github. At the time of this writing (December 2015), it consisted of a total of 1,276 tests, half with planted defects meant to evaluate the defect rate (DR) capability of the analysis tools and the other half without defects meant to evaluate their false positive rate. The tests are grouped in nine different categories: static memory, dynamic memory, stack-related, numerical, resource management, pointer-related, concurrency, inappropriate code, and miscellaneous.

The Disappearing 2014 Paper

An early 2014 version of the paper above, Quantitative Evaluation of Static Analysis Tools, was presented and published in the same avenue one year before, namely in ISSRE‘14. However, the 2014 paper has been removed from the IEEE database and we are not able to find it anywhere anymore. According to Prof. Regehr’s Blog, Coverity has a DeWitt Clause in their EULA that disallows comparisons of their tool with other tools, which is probably the reason why the 2014 paper was removed.

Although the ISSRE‘14 paper itself is not available anymore, some articles about it have survived on the Internet, such as How Toyota Evaluates Static Analysis Tools and Independent Study Names CodeSonar Best in Class after Head-to-Head Comparison. From these articles we learn that Toyota ITC has evaluated six different static analysis tools against their benchmark, and that GrammaTech’s CodeSonar was ranked overall the best among them.

The revised version of the Toyota ITC paper published in ISSRE‘15 discusses, with approval from GrammaTech and MathWorks, only three static analysis tools: GrammaTech’s CodeSonar and MathWorks’ Code Prover and Bug Finder. Kudos to the Toyota ITC researchers for their fabulous and immensely useful effort, and to Grammatech and MathWorks for allowing science to follow its path!

Running RV-Match on the Toyota ITC Benchmark

When executed on the Toyota ITC benchmark, RV-Match scores the best of any of the tools for which we were able to obtain scores. RV-Match also distinguishes itself by having a perfect false positive rate of 0 false positives. The table below shows how RV-Match compares (in January 2016) to the commercial static analysis tools discussed in the ISSRE‘15 paper:

benchmark

DR (Defect Rate) refers to the percentage of tests containing errors where the error was detected. FPR (False Positive Rate) refers to the percentage of tests not containing errors where a false positive was detected, and underlined FPR is 100 - FPR. PM refers to the productivity metric computed in the paper, which is geometric mean of DR and 100 - FPR. For the other tools referenced in the table above, we used the numbers presented in the original public ISSRE‘15 paper above by Toyota ITC, because these tools are not freely available. The green boxes mean that tool had the best score in that category for that metric. The red highlights are features not yet supported by the corresponding tool, yielding a 0 score. The dark red boxes at the bottom compare the final score of RV-Match to that of the best tool in the original ISSRE‘15 paper.

Usual compilers also perform some minimal static checks on the program. For reference, in the table above we also included the numbers we obtained with gcc and clang when run on the benchmark, with all their warning flags enabled. Therefore, as illustrated in the table, if no special analysis tools are used in your development process, you should expect your usual compiler to only catch about 5% of the undefinedness C bugs lurking in your code. Which is why it is very rarely the case that safety critical code is developed without the help of software analysis tools. There is also a variety of non-commercial analysis tools that you can use to improve the quality of your code. Below is a table showing the numbers we obtained (in January 2016) when running such freely available analysis tools on the Toyota ITC benchmark:

benchmark2

As expected, the scores of the freely available tools are lower than those of the commercial tools.

The reader should not naively assume, based on the numbers above alone, that runtime verification is simply the best program analysis approach. The best way to look at the various technologies is that they complement each other, and not that they compete with each other. For example, static analysis tools are more forgiving in terms of analyzing code that does not even compile, so they can help you find errors earlier in the process. Also, they typically analyse all your code in one run of the tool. On the other hand, dynamic analysis tools aim at very low runtime overhead, even unnoticeable if possible, at the expense of less rigorous analysis. Like dynamic analysis, runtime verification requires the program to actually execute, so it cannot analyze a program that does not compile. Also, runtime verification covers in depth only the fragment of code in your program that is actually executed, so it needs to be combined with your existing testing infrastructure (e.g., running your unit tests with the RV tool) for best results. However, unlike dynamic analysis tools, the checks that RV-Match performs are mathematically rigorous and follow the formal specification of the ISO C11 standard. For more information on the runtime verification technology underlying RV-Match, you can view this presentation.

If you would like to reproduce the results of the benchmark for yourself, you can do so by executing the following commands:

Note

The Toyota ITC Benchmark requires use of libraries, including libc, to run. Because our native Windows version currently does not fully support libraries, we recommend using the Linux version or a Linux-based virtual machine on Windows, including the provided Vagrant VM, to run the benchmark.

(use your package manager if not apt)
sudo apt-get install automake
git clone https://github.com/runtimeverification/toyota-itc-benchmarks
cd toyota-itc-benchmarks
git checkout fix-unintended-undefinedness
./bootstrap
CC=kcc LD=kcc CFLAGS=-flint ./configure
make -j4

You can then reproduce the results of any test by running the relevant program. For example, to run the first test in the second test group of tests with defects, you can run ./01.w_Defects/01.w_defects 002001. For more information about which test groups correspond to which defect types, refer to main.c. If you want to check the details of our experiments involving the freely available tools described, or to redo the experiments yourself, you can find all our records and the scripts we used on github.

Fixing Bugs in the Benchmark

Interestingly, the use of RV-Match on the Toyota ITC benchmark detected a number of flaws in the benchmark itself, both in the form of undefined behavior that was not intended by being one of the test cases of the benchmark, and in the form of tests that were intended to contain a defect but were actually correct. We have made our fixes with detailed explanations public in this pull request on GitHub, and informed the Toyota ITC authors to fix their benchmark. If you have any comments or questions regarding any of these changes please feel free to contribute to the pull request above and/or contact us at our support page.

Note that we used the fixed version of the benchmark in our experiments reported in the RV-Match columns in the table above. Fixing the benchmark bugs was not optional, because without some of the fixes the benchmark would not even compile with RV-Match due to constraint violation defects in the code. Unfortunately, we do not have access to the static analysis tools referenced in the ISSRE‘15 paper and the table above, so we just reproduced the results reported in the ISSRE‘15 paper for them. Because of that, it is possible that the metrics scored for the other tools may be off by some amount. If you have access to these static analysis tools and run the fixed benchmark yourself and find any inconsistencies, please contact us at our support page and help us fix the numbers.