Scaling Testing of Refactoring Engines

Melina Mongiovi
Advisor: Rohit Gheyi


Refactoring engines may have overly weak conditions, overly strong conditions, and transformation issues related to the refactoring definitions. We find that 84% of the test suites of Eclipse and JRRT are concerned to those kinds of bugs. However, the engines still have them. Researchers have proposed a number of automated techniques for testing refactoring engines. Nevertheless, they may have limitations related to the program generator, time consumption, kinds of bugs, and debugging. We propose a technique to scale testing of refactoring engines. We improve expressiveness of a program generator, use a technique to skip some test inputs to improve performance, and propose new oracles to detect behavioral changes using change impact analysis, overly strong conditions using mutation testing, and transformation issues related to the refactoring definitions. We evaluate our technique in 28 refactoring implementations of Java (Eclipse and JRRT) and C (Eclipse) and found 119 bugs. The technique reduces the time in 96% using skips while misses only 6% of the bugs. Using the new oracle to identify overly strong conditions, it detects more bugs and facilitates debugging different from previous works. Finally, we evaluate refactoring implementations of Eclipse and JRRT using the input programs of their refactoring test suites and find a number of bugs not detected by the developers. 



Evaluation of our technique to scale testing of refactoring engines


results
Table I: Summary of the number of generated programs and the time to evaluate the refactoring implementations.

  

toostrong
Table II: Summary of the detected bugs related to compilation errors.

  

toostrong
Table III: Summary of the detected bugs related to behavioral changes.

  

toostrong
Table V: Summary of the time to find the first failure (TTFF). Bug Type = type of the first detected failure: compiltation error (CE) or behavioral change (BC); Generated Programs = number of programs generated to find the first failure.

results
Table VI: Summary of the evaluation results of JRRT and Eclipse refactoring implementations; Refact. = kind of refactoring; GP = Generated programs by DOLLY; No of different messages = Number of different messages reported by the engine; LOC added in the mutant schemata = Lines of codes added in the mutant schemata to disable the refactoring conditions; Overly Strong Conditions = Number of detected overly strong conditions in the refactoring implementations; Time = Total time to evaluate the refactoring implementations; Time to First Failure = Time to find the first failure; "na" = not assessed;

results
Table VII: Summary of the evaluation results of JRRT and Eclipse refactoring implementations; Refact. = kind of refactoring; GP = Generated programs by DOLLY; No of different messages = Number of different messages reported by the engine; LOC added in the mutant schemata = Lines of codes added in the mutant schemata to disable the refactoring conditions; MT Tech. = mutation testing technique; DT Tech. = differential testing technique; Overly Strong Conditions = Number of detected overly strong conditions in the refactoring implementations; Time = Total time to evaluate the refactoring implementations; Time to First Failure = Time to find the first failure; "na" = not assessed.

results
Table VIII: Summary of the evaluation results of Eclipse and JRRT refactoring implementations with our technique using SCA oracle; Refactoring = kind of refactoring; Scope = scope used by DOLLY to generate programs; P = package; C = class; M = method; F = field; Alloy Instances = number of Alloy instances generated by the Alloy Analyzer; GP (using skip of 25) = number of generated programs using skip of 25 in DOLLY 2.0; CP = compilable generated programs; Transformation Issues = number of different kinds of issues related to incorrect transformations; Time = total time to evaluate the refactoring implementations.

  

Evaluation of SafeRefactorImpact


We evaluated SafeRefactor in 45 pairs of programs. First, we evaluate it in 8 non-behavior-preserving transformations applied by Eclipse. Table VIII summarizes the experimental results. The subjects can be downloaded here

results
Table VIII (a): Results using a time limit of 0.2s. Methods = number of methods passed to Randoop to generate tests; Time = the total time of the analysis in seconds; Change Coverage = the percentage of impacted methods covered; Relevant Tests = the percentage of relevant tests; Result = it states whether the transformation is behavior preserving.

  

We also analyzed 23 design patterns implemented in Java and AspectJ. Our tool identified that the AO and OO versions of the State pattern are not equivalent. Table IX summarizes the experimental results. The subjects can be found here.

toostrong
Table IX: Results using a time limit of 0.5s. Impacted Methods = number of methods identified in the change impact analysis; Methods = number of methods passed to Randoop to generate tests; Time = the total time of the analysis in seconds; Change Coverage = the percentage of impacted methods covered; Relevant Tests = the percentage of relevant tests; Result = it states whether the transformation is behavior preserving.

  

We use our tools to evaluate two JML compilers that generate AO code. Table X summarizes the experimental results. The subjects can be downloaded here.

toostrong
Table X: Results using a time limit of 0.2s. Impacted Methods = number of methods identified in the change impact analysis; Methods = number of methods passed to Randoop to generate tests; Time = the total time of the analysis in seconds; Change Coverage = the percentage of impacted methods covered; Relevant Tests = the percentage of relevant tests; Result = it states whether the transformation is behavior preserving.

  


Additionally, we compared 12 transformations applied to real OO and AO programs. Table XI summarizes the experimental results. The subjects can be found here.

toostrong
Table XI (a): Results using a time limit of 20s. Impacted Methods = number of methods identified in the change impact analysis; Methods = number of methods passed to Randoop to generate tests; Time = the total time of the analysis in seconds; Change Coverage = the percentage of impacted methods covered; Relevant Tests = the percentage of relevant tests; Result = it states whether the transformation is behavior preserving.


Alloy Specifications:
Java meta-model specified in Alloy
Java meta-model specified in Alloy (with abstract and interface)
C meta-model specified in Alloy

Tools used in our experiment:
SafeRefactor/SafeRefactorImpact
Dolly

Subjects used in the evaluation of SafeRefactorImpact:
Toy Examples
Design Patterns
JML Compilers
Larger Subjects

New Bugs of Eclipse JDT
434862: Push Down Field (compilation error)
434878: Pull Up Field (behavioral change)
434881: Move Method (overly strong condition)
434886: Move Method (overly strong condition)
462994: Pull Up Field (overly strong condition)
471952: Push Down Method (transformation issue)
473650: Push Down Method (transformation issue)
473651: Pull Up Method (transformation issue)
473653: Pull Up Method (transformation issue)
476293: Pull Up Method (transformation issue)
473655: Pull Up Method (transformation issue)
476256: Pull Up Method (transformation issue)
471953: Encapsulate Field (transformation issue)
471955: Encapsulate Field (transformation issue)
471961: Encapsulate Field (transformation issue)

New Bugs of Eclipse CDT
426896: Rename Global Variable (compilation error)
426895: Rename Global Variable (compilation error)
426894: Rename Global Variable (compilation error)
426893: Rename Global Variable (compilation error)
426899: Rename Global Variable (behavioral change)
435125: Rename Local Variable (compilation error)
425431: Rename Local Variable (compilation error)
425433: Rename Local Variable (compilation error)
435124: Rename Local Variable (compilation error)
425434: Rename Local Variable (behavioral change)
426891: Rename Define (compilation error)
426890: Rename Define (compilation error)
426889: Rename Define (compilation error)
425491: Rename Define (compilation error)
426925: Rename Function (compilation error)
426924: Rename Function (compilation error)
426922: Rename Function (compilation error)
425466: Rename Function (compilation error)
425465: Rename Function (compilation error)
425426: Rename Parameter (compilation error)
426635: Rename Parameter (compilation error)
425489: Rename Parameter (compilation error)
426636: Rename Parameter (compilation error)
425428: Rename Parameter (behavioral change)
426916: Extract Function (compilation error)
426915: Extract Function (compilation error)
426914: Extract Function (compilation error)
426913: Extract Function (compilation error)
426911: Extract Function (compilation error)
425438: Extract Function (compilation error)
396658: Extract Function (compilation error)
426918: Extract Function (behavioral change)
425446: Extract Local Variable (compilation error)
435122: Extract Local Variable (compilation error)
435121: Extract Local Variable (compilation error)
425445: Extract Local Variable (compilation error)
425444: Extract Local Variable (compilation error)
397278: Extract Local Variable (behavioral change)
397275: Extract Constant (compilation error)
425127: Extract Constant (compilation error)
435132: Extract Constant (compilation error)
435131: Extract Constant (compilation error)
435130: Extract Constant (compilation error)
435129: Extract Constant (compilation error)
435128: Extract Constant (compilation error)
425470: Extract Constant (compilation error)
425471: Extract Constant (behavioral change)

New Bugs of JRRT
1. Move Method (overly strong condition)
2. Move Method (overly strong condition)
Catalog of bugs detected by the technique using mutation testing (overly strong condition)
Catalog of bugs detected by the technique to identify transformation issues (transformation issues)


SPG - Software Productivity Group