Can static type systems speed up programming? An experimental evaluation of static and dynamic type systems

Academic Paper 2013 113 Pages

Computer Science - Miscellaneous


Table of Contents


Zusammenfassung (German Abstract)

Directory of Figures

Directory of Tables

Directory of Listings

1. Introduction

2. Motivation & Background
2.1 Motivation
2.2 Maintenance and Debugging
2.2.1 Maintenance in a Nutshell
2.2.2 Debugging in a Nutshell
2.3 Documentation and APIs
2.3.1 Documentation of Software Systems
2.3.2 APIs and Application of their Design Principles in General Programming
2.4 Type Systems
2.5 Empirical Research in Software Engineering
2.5.1 On Empirical Research
2.5.2 Controlled Experiments
2.5.3 Current State of Empirical Research in Software Engineering

3. Related Work
3.1 Gannon (1977)
3.2 Prechelt and Tichy (1998)
3.3 Daly, Sazawal and Foster (2009)
3.4 Hanenberg (2010)
3.5 Steinberg, Mayer, Stuchlik and Hanenberg - A running Experiment series
3.5.1 Steinberg (2011)
3.5.2 Mayer (2011)
3.5.3 Stuchlik and Hanenberg (2011)

4. The Experiment
4.1 The Research Question
4.2 Experiment Overview
4.2.1 Initial Considerations
4.2.2 Further Considerations: Studies on Using Students as Subjects
4.2.3 Design of the Experiment
4.3 Questionnaire
4.4 Hard- and Software Environment
4.4.1 Environment
4.4.2 Programming Languages
4.5 Workspace Applications and Tasks
4.5.1 The Java Application - A Labyrinth Game
4.5.2 The Groovy Application - A simple Mail Viewer
4.5.3 Important Changes made to both Parts
4.5.4 The Tasks
4.6 Experiment Implementation

5. Threats to Validity
5.1 Internal Validity
5.2 External Validity

6. Analysis and Results
6.1 General Descriptive Statistics
6.2 Statistical Tests and Analysis
6.2.1 Within-Subject Analysis on the complete data
6.2.2 Analysis for residual effects between the two Participant Groups
6.2.3 Within-Subject Analysis on the two Participant Groups
6.2.4 Exploratory Analysis of the Results based on Participants’ Performance
6.2.5 Hypotheses and Task based Analysis

7. Summary and Discussion
7.1 Final Remarks
7.2 Result Summary
7.3 Discussion

8. Conclusion


A. Appendix
A.1 Statistical Methods and Tests
A.1.1. Box plots (box-whisker-diagrams)
A.1.2. Kolmogorov-Smirnov and Shapiro-Wilk
A.1.3. Independent and Dependent t-test
A.1.4. Wilcoxon Signed Rank Test
A.1.5. Mann-Whitney-U and Kolmogorov-Smirnov Z test
A.1.6. Regression Analysis
A.2 Supplemental Data
A.2.1. Participant Results Tasks 1 to 9 (Java)
A.2.2. Participant Results Tasks 1 to 9 (Groovy)
A.2.3. Results of the Tests for Normal Distribution on the results split by the two Groups
A.2.4. Results of Tests for Normal Distribution for the Participant Performance Analyses
A.2.5. Participant Performance Analysis based on the complete data
A.2.5.1. Outperformers
A.2.5.2. Underperformers
A.2.6. Demographic of participants and Questionnaire Results
A.3 An Example of a problematic Experiment Design and Analysis

Directory of Figures

Figure 4-1: Assumed occurrence of learning effect in experiment design (Figure taken from [Stuchlik and Hanenberg 2011])

Figure 4-2: The labyrinth game interface along with some annotations for the participants

Figure 4-3: The mail viewer interface along with some annotations for the participants

Figure 4-4: Simplified call stack containing bug creation and runtime error

Figure 4-5: Simplified call stack containing bug creation and runtime error for task 16

Figure 4-6: Simplified Call Stack containg bug creation and runtime error for Task 18

Figure 6-1: Boxplot of complete experiment results

Figure 6-2: Boxplot of results for the Groovy starter group

Figure 6-3: Sample histogram for a positively skewed frequency distribution of task results

Figure 6-4: Boxplot of results for the Java starter group

Figure 6-5: Scatterplot of the results for the type identification tasks in Groovy

Figure 6-6: Scatterplot of the results for the type identification tasks in Java

Figure 6-7: Boxplot of the Groovy part results for tasks 7 and

Figure 6-8: Boxplot of results for task 4 and 5 results of only the first language used

Figure A-1: Example of a box plot

Directory of Tables

Table 4-1: Experiment Blocking Design

Table 4-2: Summary of the independent variables, their values, corresponding tasks and dependent variables

Table 6-1: Descriptive statistics data of Groovy tasks for all participants (in seconds)

Table 6-2: Descriptive statistics data of Java tasks for all participants (in seconds)

Table 6-3: Descriptive statistics data of total experiment time for all participants (in seconds)

Table 6-4: Results of the tests for normal-distribution for the complete data, comparing task differences based on the language used

Table 6-5: Results of the t-test and Wilcoxon Signed Rank tests on the complete data, comparing tasks based on the language used

Table 6-6: Results of the Mann-Whitney-U and Kolmogorov-Smirnov-Z test when comparing Java task results between Groups (GS=GroovyStarters)

Table 6-7: Results of the Mann-Whitney-U and Kolmogorov-Smirnov-Z test when comparing Groovy task results between Groups (JS = JavaStarters)

Table 6-8: Descriptive statistics of Groovy tasks for participants that started with Groovy (in seconds)

Table 6-9: Descriptive statistics of Java tasks for participants that started with Groovy (in seconds)

Table 6-10: Descriptive statistics of total time for participants that started with Groovy (in seconds)

Table 6-11: Results of the tests for normal-distribution for the Groovy starter group, comparing task time differences based on the language used

Table 6-12: Results of the t-test and Wilcoxon-test for the Groovy starter group

Table 6-13: Descriptive statistics of Groovy tasks for participants that started with Java (in seconds)

Table 6-14: Descriptive statistics of Java tasks for participants that started with Java (in seconds)

Table 6-15: Descriptive statistics of total time for participants that started with Java (in seconds)

Table 6-16: Results of the tests for normal-distribution for the Java starter group, comparing task time differences based on the language used

Table 6-17: Results of the t-test and Wilcoxon-test for the Java starter group

Table 6-18: Descriptive statistics of Groovy tasks for outperformer participants that started with Groovy (in seconds)

Table 6-19: Descriptive statistics of Java tasks for outperformer participants that started with Groovy (in seconds)

Table 6-20: Descriptive statistics of total time for outperformer participants that started with Groovy (in seconds)

Table 6-21: Results of the t-test and Wilcoxon test for the Groovy starter outperformer group

Table 6-22: Descriptive statistics of Groovy tasks for underperformer participants that started with Groovy (in seconds)

Table 6-23: Descriptive statistics of Java tasks for underperformer participants that started with Groovy (in seconds)

Table 6-24: Descriptive statistics of total time for underperformer participants that started with Groovy (in seconds)

Table 6-25: Results of the t-test and Wilcoxon test for the Groovy starter underperformer group

Table 6-26: Descriptive statistics of Groovy tasks for outperformer participants that started with Java (in seconds)

Table 6-27: Descriptive statistics of Java tasks for outperformer participants that started with Java (in seconds)

Table 6-28: Descriptive statistics of total time for outperformer participants that started with Java (in seconds)

Table 6-29: Results of the t-test and Wilcoxon test for the Java starter outperformer group

Table 6-30: Descriptive statistics of Groovy tasks for underperformer participants that started with Java (in seconds)

Table 6-31: Descriptive statistics of Java tasks for underperformer participants that started with Java (in seconds)

Table 6-32: Descriptive statistics of total time for underperformer participants that started with Java (in seconds)

Table 6-33: Results of the t-test and Wilcoxon test for the Java starter underperformer group

Table 6-34: Results of regression analysis for task time depending on number of types to identify and language

Table 6-35: Coefficients of regression analysis for type identification task time depending on number of types to identify and language

Table 6-36: Results of t-test and Wilcoxon test for comparing tasks 7 and 9 for Groovy

Table 6-37: Results of Mann-Whitney-U and Kolmogorov-Smirnov-Z test for first tasks 4 and 5 results

Table A-1: Participants raw results for the Java part in seconds for Groovy Starters

Table A-2: Participants raw results for the Java part in seconds for Java Starters

Table A-3: Participants raw results for the Groovy part in seconds for Groovy Starters

Table A-4: Participants raw results for the Groovy part in seconds for Java Starters

Table A-5: Results of tests for normal distribution on task results split by group (GS = GroovyStarters, JS=JavaStarters)

Table A-6: Tests for Normal Distribution on Differences between Groovy and Java times for Groovy Starter Outperformers

Table A-7: Tests for Normal Distribution on Differences between Groovy and Java times for Groovy Starter Underperformers

Table A-8: Tests for Normal Distribution on Differences between Groovy and Java times for Java Starter Outperformers

Table A-9: Tests for Normal Distribution on Differences between Groovy and Java times for Java Starter Underperformers

Table A-10: Descriptive statistics for outperformer participants over complete experiment (in seconds)

Table A-11: Descriptive statistics for outperformer participants over complete experiment (in seconds)

Table A-12: Descriptive statistics for outperformer participants over complete experiment (in seconds)

Table A-13: Tests for normal distribution on differences between groovy and java times for complete experiment outperformers

Table A-14: Results of t-test and Wilcoxon test for outperformer students of whole experiment

Table A-15: Descriptive statistics for underperformer participants over complete experiment (in seconds)

Table A-16: Descriptive statistics for underperformer participants over complete experiment (in seconds)

Table A-17: Descriptive statistics for underperformer participants over complete experiment (in seconds)

Table A-18: Tests for normal distribution on differences between groovy and java times for complete experiment underperformers

Table A-19: Results of t-test and Wilcoxon test for underperformer students of whole experiment

Table A-20: Questionnaire results for programming skill questions

Table A-21: Descriptive statistics for Groovy group of independent design (in seconds)

Table A-22: Descriptive statistics for Groovy group of independent design (in seconds)

Table A-23: Descriptive statistics for Groovy group of independent design (in seconds)

Table A-24: Descriptive statistics for Groovy group of independent design (in seconds)

Table A-25: Results of Mann-Whitney-U (MW-U) and Kolmogorov-Smirnov-Z (KS-Z) test for independent design

Directory of Listings

Listing 2-1: Examples for variable declarations in a statically typed language

Listing 2-2: Examples for variable declarations in a dynamically typed language

Listing 2-3: Redundancy of information through static type system

Listing 4-1: Simple Java Code Example

Listing 4-2: Simple Groovy Code Example

Listing 4-3: Example code to explain stack and branch size

Listing 4-4: Solution to task 1 (Java)

Listing 4-5: Solution to task 10 (Groovy)

Listing 4-6: Solution to task 2 (Java)

Listing 4-7: Solution to task 11 (Groovy)

Listing 4-8: Solution to task 3 (Java)

Listing 4-9: Solution to task 12 (Groovy)

Listing 4-10: The part of task 4 with the error that leads to wrong behavior

Listing 4-11: Solution to task 4 (Java)

Listing 4-12: Solution to task 13 (Groovy)

Listing 4-13: Part of code from task 14 with missing line to remove reference

Listing 4-14: Solution to task 14 (Groovy)

Listing 4-15: Solution to task 5 (Java)

Listing 4-16: Solution to task 6 (Java)

Listing 4-17: Solution to task 15 (Groovy)

Listing 4-18: The simulated interaction of task 16 (Groovy)

Listing 4-19: Point of bug insertion for task 16. 42

Listing 4-20: Point where bug results in runtime error for task 16

Listing 4-21: Solution to task 8 (Java)

Listing 4-22: Solution to task 17 (Groovy)

Listing 4-23: Simulated interaction for task 18 (Groovy)

Listing 4-24: Point of bug insertion for Task 18

Listing 4-25: Point where bug results in runtime Error for Task 18


Type systems of programming languages are a much discussed topic of software engineering. There are many voices arguing towards static as well as dynamic type systems, although their actual impact on software development is rarely evaluated using rigorous scientific methods. In the context of this work, a controlled experiment with 36 participants was conducted which tried to compare the performance of software developers using a static and a dynamic type system for the same tasks using an undocumented API. The two programming languages used were Java and Groovy. The experiment and its results are analyzed and discussed in this book. Its main hypothesis was that a static type system speeds up the time developers need to solve programming tasks in an undocumented API. The main results of the experiment speak strongly in favor of this hypothesis, because the static type system seems to have a significantly positive impact on the development time.

Zusammenfassung (German Abstract)

Typsysteme von Programmiersprachen sind ein vieldiskutiertes Thema in der Softwaretechnik. Es gibt sowohl für statische als auch dynamische Typsysteme große Gruppen von Befürwortern, obwohl der tatsächliche Einfluss beider auf die Softwareentwicklung selten mithilfe strenger wissenschaftlicher Methoden ausgewertet wurde. Im Kontext dieser Arbeit wurde ein kontrolliertes Experiment mit 36 Teilnehmern durchgeführt, um die Performanz von Softwareentwicklern mit einem statischen und einem dynamischen Typsystem anhand gleicher Aufgaben in einer undokumentierten Anwendung zu vergleichen. Die hierfür genutzten Programmiersprachen waren Java und Groovy. Das Experiment und die Ergebnisse werden in dieser Arbeit analysiert und diskutiert. Die Haupthypothese des Experiments besagt dass ein statisches Typsystem die Zeit verkürzt die ein Entwickler benötigt um Programmieraufgaben in einer undokumentierten Umgebung zu lösen. Die Ergebnisse sprechen stark für diese Hypothese, da das statische Typsystem tatsächlich einen signifikanten positiven Einfluss auf die Entwicklungszeit zu haben scheint.

1. Introduction

Software development is generally a complex process that is up to today almost impossible to predict. This is true even if the focus is on the pure programming part of software development and other associated tasks like requirements gathering and specification are ignored. Commonly, software developers need to fix errors or extend an existing program. Both tasks are often summarized as software maintenance and it is stated that software maintenance makes up a significant part of software project costs ([Boehm 1976], [Lientz et al. 1978] and [Gould 1975]). In addition, different programming languages are in use, which in most cases have either a static or a dynamic type system. These two type systems are the source of a controversial discussion in the scientific world as well as in the software industry. Some argue very strongly towards static type systems and the advantages they are supposed to bring, while others oppose these ideas with their own arguments on how these type systems supposedly restrict and complicate the use of programming languages. It is mostly a battle of beliefs, as both sides’ arguments are purely speculative and based on logical reasoning. While static type systems have experienced a popularity boost during the last years and are wide spread over many popular programming languages (like Java, C#, C++), there is very little scientific evidence of either their positive or negative impact on software development. This work aims toward closing this knowledge gap a little by conducting a controlled experiment that compares the performance of developers with a static and a dynamic type system on the same tasks. It is not the first experiment conducted on the impact of type systems (others are [Gannon 1977], [Prechelt and Tichy 1998], [Daly et al. 2009], [Hanenberg 2010b] and [Stuchlik and Hanenberg 2011]), but research on the topic is still scarce as the next chapters will show.

The main focus of the experiment is the impact of a static type system on development time, comparing the time needed to solve tasks in Java (which has static type system) and Groovy (using a dynamic type system). For this, several different tasks were designed and all participants of the experiment had to solve all tasks in both programming languages. Three hypotheses were the base of the experiment design. The assumptions were that first, tasks where different classes need to be identified are solved faster in a statically typed language. This was also tested with different numbers of classes to identify. Second, for semantic errors, the kind of type system should not make a difference on the completion time. Third, when using a dynamically typed language, it is assumed that it takes a longer time to fix errors if runtime error occurrence and bug insertion (meaning the location in the code that is faulty and which later results in the runtime error) are farther removed from each other. All this is based on programs that are undocumented, meaning that there are neither comments or documentation, nor variables that explicitly point toward their contained types (more about this notion of no documentation later).

Chapter 2 explains the motivation for this work and the general scientific history of empirical research in software engineering (or lack thereof) and also gives some background information on some important topics associated with this work’s context. These topics are type systems, maintenance, debugging, as well as documentation in software engineering. The chapter closes with a summary of empirical research and methods with a focus on controlled experiments. Afterwards, chapter 3 summarizes and discusses related work that has previously been conducted on the topic of static and dynamic type systems. In chapter 4, the experiment’s research question and design is explained in detail. Afterwards, chapter 5 sums up and discusses some threats to validity of the experiment results. The sixth chapter contains the complete analysis of the gathered experiment data. Then, in the seventh chapter, the experiment results are summarized and discussed. Finally, chapter 8 concludes this work.

2. Motivation & Background

In this chapter, first the motivation for this work is represented, along with some information about certain topics that are directly related to the subject of research. These topics are maintenance and debugging, documentation and APIs, as well as type systems. A summary of empirical research and controlled experiments and the current state of the art in software engineering closes the chapter.

2.1 Motivation

The main motivation of this work is to find out whether static type systems improve developer performance when doing software maintenance. Type systems are said to have some inherent qualities that supposedly support a developer when writing and maintaining code which could lead to faster development. As will be seen later, work on this specific topic is scarce and thus this work was meant to provide more insight on the topic. Also, a part of the motivation is the possible impact of APIs on software maintenance in this context.

2.2 Maintenance and Debugging

2.2.1 Maintenance in a Nutshell

Almost anyone does probably have a very intuitive understanding of what maintenance means. Spoken plainly, to maintain something is to keep it from breaking down or stop doing what it was meant to do. For software maintenance, the IEEE gives a short and understandable definition of the term: “Modification of a software product after delivery to correct faults, to improve performance or other attributes, or to adapt the product to a modified environment”, [IEEE 1998]. Not only does it include the purpose of fixing or preventing faults, but also improvements. Interestingly, in a newer version of the standard released together with the ISO, the newly definition states that software maintenance is “the totality of activities required to provide cost-effective support to a software system […]”, [ISO/IEC/IEEE 2006]. A definition that is fuzzy compared to the former, though older one. It is still more appropriate and should serve as the base definition for maintenance in this work. In the newer standard, they also define four corresponding maintenance types: Corrective, Preventive, Adaptive and Perfective Maintenance. Corrective and adaptive maintenance are in the focus of this work.

Something that is related to maintenance but infinitely harder to define is the notion of maintainability. Citing the 2006 standard again (the 98 did not yet define the term), it says that maintainability represents “the capability of the software product to be modified […]”, [ISO/IEC/IEEE 2006]. This definition (along with many similar definitions for maintainability from software engineering literature which will not be mentioned here) is very unclear. Worse, there is still no objective measure for maintainability yet. Many attempts were made to either measure it on a more quantitative base using software metrics (e.g. measuring factors that influence maintainability and infer a measure of maintainability this way) and modeling it on a qualitative base. Both approaches have yet to yield any commonly accepted measure of maintainability [Broy et al. 2006].

Software maintenance is a huge cost factor. Boehm claimed that cost of software maintenance is made up of more than 50% of the total project cost of software projects [Boehm 1976]. There a few more studies that give rough estimates, which range from 40% up to 75% of total project cost (study results summarized by [Lientz et al. 1978]). Although quite a few years have passed, other studies do not give the impression that this has changed. In an older study, Gould cites that about 25% of the time is spent on maintenance/error correction [Gould 1975]. While no current figures could be found that support all these percentages, it can be assumed (and personal experience confirms this) that maintenance still takes up a bulk of a software system’s lifetime and cost.

2.2.2 Debugging in a Nutshell

It was made clear that maintenance of software takes up huge amounts of time and money. An important part of the maintenance process is the fixing of errors (corrective maintenance) which are commonly called bugs. Consequently, removing an error from a program is called debugging. There is actually some history on the origin of the term “bug” for an error in a computer program, but it is controversial and shall not play any further role here.[1]

There are possible classifications of bugs as well as some research works on debugging approaches and modeling them ([Katz and Anderson 1987], [Vessey 1986] and [Ducassé and Emde 1988] for example). According to them all, the debugging process always involves the following tasks (not necessarily in this exact granularity or order): Reading and understanding program code to locate the possible error source, possibly introduce test outputs, gain enough understanding of the program to find a solution for the problem and test the solution.[2] These tasks become significantly harder when the program to fix was written by someone else or if a long time has passed since someone took a look into the code he has written.

A bug can also be the deviation of a program from its specification, which does not necessarily always lead to an error. So it also important that a programmer knows what the actual intention of the program is to fix this type of bug. Ducasse classifies some of the knowledge that helps in the debugging process, among it the knowledge of the intended program, the knowledge of the actual program, an understanding of the programming language, general programming expertise, knowledge of the application domain, knowledge of bugs and knowledge of debugging methods [Ducassé and Emde 1988].

2.3 Documentation and APIs

2.3.1 Documentation of Software Systems

Before being able to change a program, one must first understand it or at least the part needing change. In other words, gain knowledge about it. Schneidewind says that “it is a major detective operation to find out how the program works, and each attempt to change it sets off mysterious bugs form the tangled undergrowth of unstructed code”, [Schneidewind 1987]. So because a good understanding of a program is extremely important for maintenance and debugging tasks, it seems wise to take a look at the documentation aspect of software systems.

Understanding a program or relevant parts of it before being able to make changes takes a significant amount of the total software maintenance time, especially if the program was written by someone else. [Standish 1984] claims it to be 50-90% of total maintenance cost, although a more recent study by Tjortjis estimates about 30% of maintenance time was devoted to program comprehension [Tjortjis and Layzell 2001][3]. There are different types of documentation artifacts, like class diagrams, flow charts, data dictionaries, glossaries, requirements documents, and more. But in many cases, the only documentation available is the program code and possibly contained comments.

So, whenever programmers need to change an existing program, “the automated extraction of design documentation from the source code of a legacy system is often the only reliable description of what the software system is doing”, [Buss and Henshaw 1992]. This is true not only for the automated extraction. Often programmers have to manually read or at least quickly scan huge parts of the code to understand what is going on. Sousa and Moreira conducted a field study and concluded that among the three biggest problems related to the software maintenance process is the lack of documentation of applications [Sousa 1998], leading to the necessity of reading the code. Similar results concerning the primary usage of code as documentation have been reported by [Singer et al. 1997], [de Souza et al. 2005] and [Das et al. 2007].

It should be mentioned that the decision to use source code as a base for program understanding was not always the first choice, but sometimes rather a last resort due to lack of other documentation artifacts. On the other hand, the study by [de Souza et al. 2005] states that source code was always the most used artifact, no matter if other documentation artifacts existed. This contradicts the notion that code is a last resort documentation. Furthermore, there are many reasons for other types of documentation to be either completely absent or outdated in comparison to the current code base. For example, other artifacts are seldom updated along with maintenance in the code to reflect these changes (even comments or annotations in the code tend to “age”), sometimes due to lack of time, budget or motivation.

All these problems are far beyond the scope of this work, which from a documentation view focuses exclusively on the documentation value of the source code. First, source code will always be present even when all other artifacts are missing or outdated, and second it can be assumed that building a cognitive model from code is easier for a programmer. After all, it is what all programmers are used to see regularly. As an interesting side note with practitioner’s view on the pros and cons of documenting software, there is a growing community of followers of the so called “Clean Code” initiative. The “Clean Code” approach was originally spawned by Robert C. Martin in his book [Martin 2009]. Basically, he proposes that good source code is meant to document and explain itself, without the need for many comments. Source code should be written in a way that a reader can read it literally like a book, with speaking method and variable names that clearly state their purpose and function.

2.3.2 APIs and Application of their Design Principles in General Programming

Assuming that a programmer only has the available source code as documentation, he needs to make do with what he can get. Depending on the circumstances, there is a difference whether he reuses components from the outside or simply jumps into an existing program and makes changes there. Commonly, if code has been written to be used in different contexts and is in itself a finished component, the part of the code that can be used from the outside is commonly known as the API (Application Programming Interface). An API generally consists of different classes with different methods, along with possible documentation. These need to be public so that they can be used from the outside of the component. Put another way, they are the adapters which should be used to plug the component into another program. The component’s internals are usually hidden behind the API classes and methods.

There are many reusable APIs available; Java for example is supplied with its own set of class libraries that provide functionality from printing to a console window to reading and writing files to disk. Most current programs are written with a smaller or larger amount of API usage in them. In this work, programmers did not have to use any specific third-party API, but only the classes given to them, to whom they had full access. This is generally not considered as API usage, although the principles of good API design still apply here. An important rule is given by Henning in [Henning 2007]: “APIs should be designed from the perspective of the caller”. He also gives a very important statement which neatly summarizes what APIs are all about: “Even though we tend to think of APIs as machine interfaces, they are not: they are human-machine interfaces”, [Henning 2007]. Disregarding for a moment the fact that having complete access to a program’s code is not exactly what one would call API use, the same principles should nevertheless apply to all parts of program code in general. It is reasonable to assume that a programmer who has to fix a bug in existing code is often still at a loss when trying to understand many functions just by looking at the interface of a class.

2.4 Type Systems

Broken down, the essence of a type system is that it constraints the use of variables and other statements in a program by enforcing them to adhere to a certain type (like containing a text, or a number, but not both). Cardelli and Wegner use an interesting metaphor to describe the fundamental purpose of a type system: “A type may be viewed as a set of clothes (or a suit of armor) that protects an underlying untyped representation from arbitrary or unintended use.” [Cardelli and Wegner 1985].

Two types of type systems are common, the static and the dynamic type system. The difference between a dynamic and static type system is the time at which the type of an object is actually checked. A programming language implementing a static type system (like Java, C++ or C# to name just a few popular ones) usually has a type checker that can tell the developer if there are any errors in the program based on the static information in the written code. The code does not need to be executed to find these errors. For example if the programmer tries to put an object of type Ship into a variable of type Car the compiler can detect this error and tell the programmer. Usually this means a program cannot be run until the compiler can detect no more type errors based on static information. A dynamic type system usually does this check only at runtime. This means the program will run, but as soon as the program tries to tell the Car object to “SetSail” (a method only a ship would have), the program terminates with a runtime type error.

So the main difference of the two type systems is the time at which a type is checked for its constraints. Sometimes dynamically typed languages are mistaken for untyped/typeless languages, which is wrong. Dynamically typed languages use types, although do not enforce them statically and perform the type check during runtime.

Abbildung in dieser Leseprobe nicht enthalten

Listing 2-1: Examples for variable declarations in a statically typed language

Abbildung in dieser Leseprobe nicht enthalten

Listing 2-2: Examples for variable declarations in a dynamically typed language

The first of the above code snippets shows some variable declarations and values/objects put into them for a statically typed language. The number variable is declared as type int, which means it can store whole numbers. The last line would result in an error during compilation, because the type checker can see that the programmer is trying to put a text into a variable of type int and tell him. It thus prevents a possible error during program execution. In the second snippet, the information of what types the variables are of is omitted. It is very possible to put a number into a variable, and immediately afterwards replace its contents with a text. This could lead to a runtime error (if the programmer tries to multiply two variables that contain texts by mistake).

Both systems have their intrinsic advantages and disadvantages, of which some should be discussed here from the viewpoint of static type systems.

Advantages of a static type system (taken from [Cardelli 1997] and [Pierce 2002])

- A static type system prevents the programmer from making mundane type related errors through disciplining him because of the type enforcement. (Cardelli page 6 and Pierce pages 4-5)
- Because of the static type information available they can detect a lot of type related errors (calling a method on a wrong type) during compilation and thus reduce the amount of runtime type errors (Cardelli page 6 and Pierce pages 4-5)
- As another result of the reduced type errors, they also minimize security risks, e.g. by preventing harmful type conversions (Cardelli page 6 and Pierce pages 6-8)
- A static type system can provide the reader of code with an implicit documentation. Because a static type system enforces type declarations for variables, method parameters and return types, it implicitly increases the documentation factor by making the code speak for itself. (Pierce page 5)
- A type system may enable certain forms of optimization by the compiler or the runtime environment because type casts and runtime checks are made obsolete in certain situations. It can thus make the language more efficient (Pierce page 8)
Disadvantages of a static type system (all taken from [Tratt 2009][4] pages 7-10)
- Restrictions on the range of possible applications. Because a type system can be overly restrictive and force the programmer to sometimes work around the type system.
- Limitations on the degree of possible modification during runtime. Statically typed programming languages can only rely on heavily complex reflection operations to be able to be changed during runtime.
- They can get in the way of simple changes or additions to the program which would be easily implemented in a dynamic type system but make it difficult in the static type systems because of dependencies that always have to be type correct.

In addition to the disadvantages taken from Tratt, there is also the general notion of documentation redundancy introduced through the use of static type systems, leading to very verbose code. Simply consider the following example to see the point:

Abbildung in dieser Leseprobe nicht enthalten

Listing 2-3: Redundancy of information through static type system

In the end, both sides have good arguments that are logically coherent. Nonetheless, it remains a conflict of ideologies as long as no reliable results from multiple studies or other scientific methods are available. This experiment focuses mainly on the documentation and error prevention aspect of type systems when trying to shed some light on their advantages and disadvantages.

2.5 Empirical Research in Software Engineering

2.5.1 On Empirical Research

By “research”, this work refers to the rigorous scientific research methods that are usually applied in the natural and some of the social sciences and have matured over hundreds of years. Normally, this type of research is driven by the desire of the researcher to answer a question, possibly after observing some condition in reality and then formulating a theory (an explanation) about the nature of this condition. This can include how or why it might have occurred or what it is made of, among other things. Anything occurring in nature or anywhere else can be the subject of such a theory.

Next, a hypothesis is formulated based on the theory which predicts the condition or aspects of it. This hypothesis can be rejected or hardened (it can never be proven) by collecting data relevant for this hypothesis and analyzing it. Do the data confirm the hypothesis, the theory seems more sound, if they contradict the hypothesis, then the hypothesis is considered falsified and has to be rejected (although a new theory based on changed assumptions can be created afterwards). Data collection is commonly done using experiments or other methods of empirical research. Although this approach dates back to Sir Francis Bacon[5] and others that have modified the ideas over time, Karl Popper [Popper 2008], a famous scientific philosopher, is quoted most frequently in this context. Popper believed that experimental observation is the key to scientific discovery and proposed that hypotheses should be falsifiable.

The above described method of scientific research that is concerned with creating theories and hardening or falsifying those using experiments or observations is nowadays commonly called empirical research. The term empiricism is derived from a Greek word for “based on experience”. Empirical research uses different methods which also differ between the sciences that apply them, e.g. the natural sciences, psychology, social science and medicine. Each uses different methods and approaches that have proven to produce reliable and valid results in specific areas. The notion of validity means that something really measured what it was made to measure, and reliability means that it measures this something consistently across different conditions (e.g. when taking the measure repeatedly while assuming all other variables are similar, the measured result should be similar for both measures).

2.5.2 Controlled Experiments

Although there are whole books dedicated to the art of creating and designing experiments and studies, only a short summary of controlled experiments is presented here. More information can be taken from [Prechelt 2001], which is a good introduction into experimentation for the software engineering discipline.[6] The following part explains controlled experiments. A summary of more methods can also be found in this authors own Bachelor thesis [Kleinschmager 2009].

The method of research used in this work is a controlled experiment. These experiments try to rigorously control the experiment conditions by keeping as many factors constant as is possibly, while deliberately manipulating only one or a few experiment variables. Conducting a controlled experiment can be very cumbersome and difficult, because planning and implementation take a lot of time and consideration as well as attention to detail.

There are three types of variables in controlled experiments (not to be confused with programming variables): Independent, dependent and the sum of all other variables, sometimes also called “noise” or unsystematic variation. The first type, the independent variable, is manipulated by the experimenter and then the impact of the manipulation is measured using the dependent variables (which are called dependent because they depend on the independent variable). Variation resulting from manipulation of the independent variables is called systematic variation. An example would be the time an athlete needs until his heartbeat reaches a certain limit (the dependent variable) depending on what kind of exercise he has to do (the independent variable).

Controlled experiments with humans are especially tricky to design because the human factor introduces an infinite amount of unsystematic variation (for example, the athlete might have slept badly the night before he did the first exercise, but had a very refreshing sleep during the night before the second exercise). Unsystematic variation or “noise” can seriously harm the usefulness of any results, and therefore many measures need to be taken in order to reduce them as much as possible. As will be explained later in more detail, one of these measures is to design experiments where all participants partake in all conditions and also in a random or balanced order. This makes it easier to calculate the systematic variation using statistical methods.

All in all, controlled experiments have a high validity and can easily be reproduced many times (when following the exact setup and implementation), producing reliable and comparable results. Their biggest disadvantage is their large cost in time and work for preparation, buildup and evaluation.

2.5.3 Current State of Empirical Research in Software Engineering

Software systems today are getting more and more complex, and more and more expensive to develop. Their impact on everyday life is immense. Cars, planes, medical equipment, computers for financial transactions and almost infinitely more examples of machines or devices depend on software.

Under normal circumstances, one might think that the creation and maintenance of software would be a well-researched and perfected field of work. But in software engineering –sadly-, the situation is quite the contrary. There is a huge deficit of research based on experimenting and hypotheses that can be falsified. In most parts, the current state-of-the-art in the software sciences lacks scientific method, which is insufficient and inadequate for a field that claims to do science. In 1976, Boehm tried the definition of software engineering as: “The practical application of scientific knowledge in the design and construction of computer programs and the associated documentation required to develop, operate, and maintain them”, [Boehm 1976]. Intuitively, this definition still fits quite well today, although Boehm leaves open what exactly he means by scientific knowledge. In 1991, Basili and Selby already wrote that the “immaturity of the field is reflected by the fact that most of its technologies have not yet been analyzed to determine their effects on quality and productivity. Moreover, when these analyses have occurred the resulting guidance is not quantitative but only ethereal.” [Basili and Selby 1991].

In the days of 1991, software engineering was still young and in its early stages and evolution, so it might be alright to say that such a young field had yet to find its scientific base. But, some years later, apparently not much had changed. Lukowicz, et al. did a study on research articles [Lukowicz et al. 1994] and found out that from over 400 articles only a fraction included experimental validation (and there is no mentioning of the quality of the experimental setup and analysis in the other articles). In 1996, Basili again criticized the lack of experimentation and learning, although a few studies had already been conducted. He says that there should be a “cycle of model building, experimentation and learning. We cannot rely solely on observation followed by logical thought” [Basili 1996].

The current state is that observation is still mainly followed by logical thought and arguments towards generalization, maybe model building. In some cases, even a field study is conducted, which is commendable, but hardly sufficient for science.

There are many arguments and excuses to not do experiments in software sciences. In 1997, Tichy summarized some of them [Tichy 1997]. Snelting [Snelting 1998] even accuses software scientists of applying constructivism and pleads for a more rigorous methodological research approach in software engineering, but also sees some improvement some years later [Snelting 2001]. So it seems obvious that there is a lot of arguing about which direction the software sciences should go. Both sides do have reasonable arguments (some good examples are [Denning 2005] and [Génova 2010]).

All this does not mean that experiments are the holy grail of science, but the status quo should be that no good model or theory can hold or be generalizable without sound experimental validation. Free speculation and experimentation should work in hand in hand, as is demanded in [Génova 2010], even if he argues strongly towards a more speculative approach. Especially the human factor in software engineering has been neglected for many years, as Hanenberg rightfully criticizes [Hanenberg 2010c]. It is great if people come up with new techniques, models and approaches, as long as they are usable and systematic studies with humans show that it helps them develop better software. Because it is the humans, in this case especially the developers, who have to put all things together and actually implement the software, no matter how great the tool support and theoretical foundation is. The current state of events is that techniques, models and approaches develop (and often vanish) much quicker than they can be validated in scientific experiments.

3. Related Work

This chapter summarizes the results from the few preceding studies available on static type systems. All of the here presented experiments are those that use humans as subjects of the experiment and are specifically targeted on comparing static and dynamic type systems. There are other works which for example focus on using a type checker on a program previously written with a dynamically typed language to check whether these programs contain possible errors [Furr et al. 2009]. Some analyze a large base of open source projects and try to measure programmer productivity based on the language used, like [Delorey et al. 2007]. Others compare the general performance of developers for a task using very different programming languages and/or the runtime performance of the final program [Hudak and Jones 1994], [Gat 2000] and [Prechelt 2000]. But as this work was targeted on evaluating the impact on the performance of developers, only papers that implement a comparable design with humans as participants are mentioned.

3.1 Gannon (1977)

The oldest study that could be found which directly experimented on the impact of type systems was conducted in 1977 by Gannon [Gannon 1977]. It used a simple repeated measures cross-over design with 38 graduate and undergraduate students who had to program the solution to a problem twice; once with a statically typed language and the second time with a dynamically typed language. Both languages were designed specifically for the experiment. The participants were split into two groups, each group starting with a different language and then using the other the second time. The hypothesis was that the reliability of software was enhanced by a language if the errors made by the participants in that language where less numerous than those of the second language. He did not give any clear indication of favoring any of the two languages in the hypothesis.

The methodology used has the advantage of the repeated-measures design by giving the possibility of a within-subject analysis. Although there certainly would have been a carry-over effect because it was the same task that had to be solved both times, so the participants might have already had a “solution roadmap” in their minds after finishing with the first language. His measure against this was to rely on the features of the two languages that were explicitly altered: The dynamically typed language did not include any built-in string functionality, which the participants had to build themselves. It can be argued that this considerably threatens the experiments internal validity because it is not clear if the results measure the impact of the type system or the impact of missing string operations for the dynamically typed language.

Gannon stated that the results were that the statically typed language increased programming reliability and that inexperienced students benefit more from it. They are based primarily on measuring the number of runs in the environment and the number of error occurrences. Only some of the results are statistically significant. In general, the results are questionable, because of the fact that the participants had quite a few problems with the missing string functionality in the dynamically typed language.

3.2 Prechelt and Tichy (1998)

In 1998, Prechelt and Tichy conducted a similar experiment using two different variations of the C programming language [Prechelt and Tichy 1998], one employing static type checking and one employing dynamic type checking. 40 participants (most of them PhD computer science students) were first split into main groups, one working with the type checker, one without. They then employed a slightly more complicated design, where all participants had to solve both parts of the experiment, although each with a different language. Subgroups were assigned which differed in both the order of the tasks (A then B or vice versa) and also the order of the language to use (first with type checker, then without). So in the end, all participants had been split among four roughly equal sized groups.

The tasks were supposed to be short and modestly complex. Their hypotheses were that type checking increases interface use productivity, reduces the number of defects and also reduces the defect lifetime. They made sure that the programmers were familiar with the language, so that the majority of problems would result from using the rather complex library they were given to solve the tasks. Dependent variables were the number of defects that were introduced, changed or removed with each program version and put them into different defect categories. In addition, the number of compilation cycles and time till delivery were measured.

The results strengthened all their hypotheses and they concluded that type checking increases productivity, reduces defects and also the time they stay in the program. This is based primarily on the number of defects in the delivered programs as well as the reduced defect lifetime that was achieved through the static type system. While the results seem sound from a methodical point of view, there are possible sources of strong unsystematic variation. They discuss some of them, including the learning effect they definitely measured. But (without having the exact task implementations to look at) it seems that -judging from their rough description of the two tasks- that the difference between them could have been a strong source of unsystematic variation. This is a fact they did not mention. Nevertheless, it is still one of the few methodologically sound experiments conducted on type systems and deserves approval, especially because it was the second experiment on the topic ever.

3.3 Daly, Sazawal and Foster (2009)

Many years later, in 2009, Daly, Sazawal and Foster [Daly et al. 2009] conducted a small study using the scripting language Ruby, which is dynamically typed, and Diamond Ruby, a static type system for Ruby. They had four participants and their design was also a repeated-measures cross-over design. All participants were said to be familiar with Ruby and were recruited from a user group of practitioners. In contrast to the other studies mentioned here, they only ran a qualitative analysis on the results without any statistical measures (which would not have yielded any useful results with only four participants anyway).

They could not find any specific advantage of the type system. Apart from some threats to validity that the authors’ already mentioned themselves, there again is the difference between the tasks that might have introduced some random effect, even if the authors claim that they were of approximately similar complexity (one was a simplified Sudoku solver and one a maze solver). What is more, the first participant was not given starter code and therefore only solved a fraction of the total tasks’ work. And he also did not have internet access, which the experimenters realized was a mistake and made it accessible for the next three; both very possible sources of unwanted effects.

Although the results of the study are more or less exploratory and only qualitative, the authors could induce an interesting hypothesis from their analysis: They reason that in small scale applications like the one from the experiment developers can compensate for the lack of a type system by relying on their own memory or by giving meaningful names to variables and other code artifacts. This is an interesting hypothesis which would be worth testing in a larger experiment.

3.4 Hanenberg (2010)

Another huge experiment was conducted in 2009 by Hanenberg [Hanenberg 2010b], [Hanenberg 2010a]. In his experiment, a total of 49 participants (undergraduate students) had to solve two tasks of writing a scanner and a parser in a language called Purity specifically written for the experiment in a statically and a dynamically typed version. The hypothesis was that the statically typed language would have a positive impact on the development time. He used an independent (between-subject) design where every participant only solved the two tasks once with either the statically or the dynamically typed language. The reasoning behind this design was that participants would probably be biased toward the same language with a different type system after having used it with another type system already.

Compared to the other previously mentioned experiments, where the total experiment time was about four hours per participant, here they had 27 hours of working time (about 45 hours when including teaching time). He also took a different approach by making the time a fixed factor, because the 27 hours were the maximum time given to the participants, in contrast to other experiments were participants usually had as many time as needed. It was also designed in a way that it was very hard to actually fulfill all requirements in the provided time frame.

The results were that the type systems never had a significantly positive impact on the result, in once case of the tests even produced a significantly negative impact, even though this did not lead to overall significantly negative results for the type system. Concerning threats to validity, Hanenberg discusses quite a few of them in detail. But it should be noted that a huge threat to the validity of the results that Hanenberg already mentioned himself is the amount of unsystematic variation that possibly lurks in the independent design of the experiment. Because no within-subject comparison is possible, any differences between the two type systems’ performance could also have been due to differences in participant quality and many other factors which could have interfered with measuring the intended effect.

3.5 Steinberg, Mayer, Stuchlik and Hanenberg - A running Experiment series

Next, it should be mentioned that a set of experiments was conducted at this institute and they form an experiment series that focuses on the comparison of static and dynamic type systems. Some of them are mentioned with their description and some preliminary results in a summarizing report [Hanenberg 2011]. This work can be considered a part of the series, too.

3.5.1 Steinberg (2011)

One of the still unpublished experiments (the Master thesis by Steinberg [Steinberg 2011]) investigated the impact of the type system on debugging for type errors and semantic errors with 30 participants in a repeated-measures cross-over design. It turned out that the type system speeded up the fixing of type errors, and no significant difference was discovered for fixing semantic errors. This could be considered a success, although some contradicting results were achieved that falsified one of the hypotheses which stated that the farther the code that is responsible for the error is away from the point where the error actually occurs, the greater the fixing time should be. The study might have suffered from a huge learning effect and other factors like a larger influence of the kinds of programming tasks on the unsystematic variation.

3.5.2 Mayer (2011)

The second still unpublished experiment (the Bachelor thesis by Mayer [Mayer 2011]) has the hypothesis that a static type system aids in the use of an undocumented API and shortens the time needed for a task. Again, a two group repeated measures design was used and 27 participants took part. The results were at the time of this work’s writing not finished, but initial results revealed rejection of the hypothesis for some tasks and confirmation for others. Again, the results seem inconclusive for the experiment.

3.5.3 Stuchlik and Hanenberg (2011)

Despite the fact that some work is still underway and unpublished, one of the earlier experiments of the series already spawned a separate publication [Stuchlik and Hanenberg 2011] and therefore deserves a deeper look: In the experiment, 21 participants (undergraduates) had to solve 7 Tasks in random order in Java as well as in Groovy. For each language, an own application was used to reduce the learning effect that decreased usefulness of the results in further experiments, even if the two application where structurally equal, only the methods and classes were renamed. One of the assumption was that the more type casts would be needed for a task with a static type system, the larger the difference would be in time taken between the static and the dynamic type system.

Two of the tasks had to be discarded for the analysis because of many comprehension problems with the task descriptions, resulting in a lot of variation. But the results show that there was a significant positive impact of the dynamic type system for some of the tasks. Their reasoning is that type casts are not a trivial aspect of static type systems and need some intellectual effort on part of the developer (even though the results of the better Java developers in the study did not show this effect). Additionally, the initial assumption that more type casts also lead to longer development time had to be rejected, leading the authors to reason that for larger tasks type casts do not play such a major role as they were assumed to.

One of the study’s problems was that the tasks were rather constructed tasks that forced the use of type casts were some developers would argue they would not have been needed in a real application. The combined analysis of both groups as one set is also rather problematic, but this fact was already mentioned by the authors and only makes up a small part of the analysis. Using the development time as the major dependent variable is also problematic from the view that software development is usually so much more than just trying to solve a programming task as quickly as possible, but that fact can be overlooked because it was designed as a controlled experiment where as many factors as possible need to be fixed. Also, there are no other objective measures that can be applied to software.

4. The Experiment

Chapter 4 gives an overview of the complete experiment structure along with the specific research question(s) behind its design. The mentioned research question and the hypotheses are explained first, followed by the experiment overview. The overview starts with a short argumentation and some thoughts on the reasons for this experiment design. It also includes some considerations on the use of students as subjects. It then explains the design starting with the questionnaire, the hard- and software environment and last the involved application and exact task categories and descriptions. A short summary of the experiment implementation concludes the chapter.

4.1 The Research Question

In experimenting, a certain question drives the researcher to formulate one or more hypotheses he then strives to test. As already stated, this work is concerned with questions regarding different developer performance with static and dynamic type systems of programming languages. There have been other experiments aiming in the same direction at this institute, which this work builds upon and tries to deepen the insight on the matter. These mentioned experiments are described in [Steinberg 2011] and [Mayer 2011] (or a summary [Hanenberg 2011]) and have already been explained in more detail in the related work part. The first work was actually the one that led to the creation of the development and measurement environment that was used in the second one and in this experiment. The following hypotheses are similar to some hypotheses from these earlier works.

One assumption behind the first hypothesis is that a static type system documents code and makes programming tasks easier. The way the “make easier” part was measured in this experiment was through the development time it takes a participant to solve a programming task. This leads to the first hypothesis:

Hypothesis 1:

Participants solve a programming task with an undocumented API faster when using a statically typed language.

Hypothesis 1 was used in both preceding experiments and could be verified in the second (by Mayer). A similar conclusion could be made in the first, although only targeted at tasks involving debugging, which is the reason why this experiment reused the hypothesis: In the hopes of gaining more insight and hopefully verify the results from the second experiment.

The second assumption is that a type system makes debugging an application easier, but only in certain cases. In other cases, it can be assumed that a static type system does not give any significant advantage for debugging. This leads to two hypotheses and separate measurements.

Hypothesis 2-1:

The further away an actual error is from the bug that is its source, the longer it takes for a participant to fix it when using a dynamically typed language.

Hypothesis 2-2:

It takes the same time to find and fix a semantic programming error no matter whether the language used is statically or dynamically typed..

Hypothesis 2-1 is specifically targeted at dynamically typed languages, as that kind of error it describes leads to a compile-time error in statically typed languages. So the notion of “distance” between a bug and it resulting in an error does not apply in a static type system. A similar hypothesis could only be partially verified in Steinberg’s experiment.

Hypotheses 2-2 however, aims in the other direction, by assuming that for semantic errors, it should make no difference whether the error is searched in a statically or dynamically typed language. Having type information should not significantly aid in finding these errors. This was also verified in the first experiment by Steinberg.

4.2 Experiment Overview

4.2.1 Initial Considerations

One thing that needs mentioning about experiment design is the fact that some experiments in computer science use a problematic design (one example can be found in [Wohlin 2000] and some more are summarized in [Juristo and Moreno 2001]). In experiments, the goal is to test an assumption or theory by measuring a certain effect when changing one factor of the experiment and keeping everything else constant (so as to not have any other side effects on the results).

To demonstrate the problem of unwanted side effects, the following fictive design for an experiment similar to this one should be considered: The participants are split into two groups, one group that solves only the Java tasks and one Group that solves only the Groovy tasks. In this experiment, the -simplified- goal is to measure developer performance depending on the language used. The used language is the independent variable which is manipulated by the experiment designer. The importance here lies on the fact that each group only completes either the Java or the Groovy part, not both. The problem with such a design (called independent design or independent-measures design) is that without being able to compare a participants’ performance in both parts, it cannot be said whether the results of two participants from separate groups are different because of the programming language they used (systematic variation) or if one was simply a really good programmer or had a lucky day, the other was a bad programmer or had a bad day or any other random factor (unsystematic variation). To demonstrate this difference, the appendix contains an analysis of this experiment by treating the results as if it had used the independent-measures-design just described (A.3).

It should be mentioned that some of these experiments very probably ended up with a threatened validity because no one or at least very few people in the software engineering area actually have any experience in designing and analyzing experiments. By criticizing these designs, not the effort of conducting an experiment is criticized (an effort which is commendable and should be given credit, considering the circumstances), but the validity of some of the results. Implementing an experiment is a hard piece of work and all experimenters in software engineering are still in the middle of a learning process. This work is no exception to the rule. Even if it may benefit from mistakes others made, it itself might someday be found faulty in some part. This can never be ruled out and the ultimate goal for all experimenting researchers should be a sound experimental methodology for software engineering.

However, to finally argue towards the actual design of this work’s experiment, there is a method to cope with the problem of unsystematic variation called repeated-measures designs. In these kinds of designs, participant performance is measured with all values of the independent variable (some say under all treatments/conditions). This means participants need to complete all tasks with both languages. Also, participants are commonly to the groups randomly. Now there are two effects at work in the results, one is the manipulation of the independent variable between both parts and one summarizes all kinds of other effects that might influence the second part the participant completes. In general, the manipulation effect should be much stronger. Hence, this kind of design was used in the experiment.

4.2.2 Further Considerations: Studies on Using Students as Subjects

Related work that does not directly correlate with what is done in this study but may have an important impact on its validity is research on using students as subjects in programming experiments. Many studies in software engineering do use students as experiment subjects/participants and this work is no exception. This raises some questions about the possible impact of using students in these experiments and how this might influence how their results can be interpreted as well as their validity. A general discussion about the problem can be found in [Carver et al. 2003] and some of the studies are summarized here, although only those with an explicit focus on programming/software development.

In 2003, a first study was conducted by Höst, Regnell and Wohlin [Höst et al. 2000]. Students and professional software developers had to solve non-trivial tasks where they had to assess the effect of different factors on software development project lead-time. They only found minor differences in the conception and no significant differences in correctness between the two groups.

Also in 2003, Runeson [Runeson 2003] did a study comparing the results of freshmen and graduate students and relating them to results from an industry study. It involved solving a set of programming tasks with growing complexity and two main hypotheses focusing on improvement during the task levels and on the general performance. Their results were that improvements between the task levels was similar for all three groups (freshmen, graduate, industry) but that the freshmen students need significantly more time to fulfill the tasks than graduates students (no comparison was done with the industry group).

Staron did some research in 2007 trying to evaluate whether the students that are used as subjects in software engineering experiments improve their learning process by participating [Staron 2007]. He used a survey to find out the subjective impact that the students felt the experiment had on them and whether they benefitted from partaking. The results show that students generally perceive the experiments as positive and very useful und that their learning benefitted from it.

It can be concluded that students are valid subject in experiments, especially when considering one important point. What is important is the within-subject data of the experiment, not necessarily the between-subject comparison. This means that even if students solve tasks slower than professionals, it makes sense that they do so consistently, meaning both parts will be solved slower by a student. But this should not have any impact on the within-subject analysis.

4.2.3 Design of the Experiment

After deciding on the design of the overall experiment (the repeated-measures approach, as explained above), the tasks were distributed according to the design. All in all, 9 Tasks were part of the controlled experiment. These tasks had to be repeated in both languages, so that nine tasks formed one part in one programming language, the other nine were the corresponding group for a different programming language. This also means the main independent variable was the programming language, its two values being Groovy and Java (representative for dynamic and static type systems). Again, the experiment’s dependent and independent variables should not be confused with the programming variables in the programs. Although the tasks were similar for both languages, the complete program was modified by renaming all code artifacts for the second language to obfuscate this. In addition, to make the application completely undocumented, variable names were modified so that they did not match the types they contained (more on this approach during the task descriptions). The variable names in the programs were still chosen to represent a useful domain aspect, but did not point to their exact contained types. The dependent experiment variable that was used to measure the performance was the time the participants needed for the tasks. In addition to the regular tasks, a small warm-up task was provided for each programming language, so that the participants could familiarize themselves with the experiment environment as well as the task descriptions and the programming language.

The tasks were numbered from 1 to 9 (the Java part) and 10 to 18 (the Groovy part). This numbering system was used primarily to give participants the impression that they are really working on different tasks, not the same in both languages. Every participant had to solve both experiment parts and fill out a questionnaire in order to have completed the experiment (more information on the questionnaire results and the participant demography in the appendix under A.2.6). The order in which the participants had to fulfill the tasks was based on the two parts chosen in a randomly alternating order, so that one group of all participants started with the tasks in Java, and another with the Groovy tasks. Inside the experiment parts, the tasks had to be solved strictly in ascending order: A participant starting with Groovy therefore solved first tasks 10 through 18, and then 1 through 9. The blocking can be summarized in a simple table:

Abbildung in dieser Leseprobe nicht enthalten

Table 4-1: Experiment Blocking Design

It is important to note than in most of this work the tasks will be referred to by the numbers 1 through 9, not 1 through 18, as the latter numbering was only introduced to obfuscate the similarities of the two parts.

A learning effect was anticipated in the design; even if measures were taken to minimize it (like renaming everything in the application, but more about the environment and the task design later). Nevertheless, the nature of the similar tasks in both languages was bound to produce some kind of learning effect. That is why this kind of within-subject design was chosen: To have two groups to compare and detect the learning effect. The impact of this learning effect and its interaction with the experiment design is depicted in Figure 4-1, which was taken from the mentioned related work by Stuchlik and Hanenberg.

It should be a reasonable assumption that the Java starters group would benefit from a learning effect when solving the Groovy part (keeping in mind that a definite learning effect was to be expected). But the additional effort needed to solve the tasks with Groovy (remembering hypothesis 1) should either cancel out that learning effect (resulting in approximately similar times for both Java and Groovy) or be weaker (so that the Java starter would still be slightly slower with Groovy after Java). For the group starting with Groovy however, they should benefit from the learning effect and the positive effect of the type system when solving the Java (their second) part. So here the two effects should add up and result in definite favor of the static type system language Java.

Abbildung in dieser Leseprobe nicht enthalten

Figure 4-1: Assumed occurrence of learning effect in experiment design (Figure taken from [Stuchlik and Hanenberg 2011])

The participants were given specific instructions during their introduction to the experiment, which included explaining the environment and editor, as well as possible does and don’ts in the experiment context. E.g. they did not need to write their own classes, just modify or use existing ones, no native API classes had to be used. They also received a picture of the two applications with some annotations to explain the underlying domain model.

4.3 Questionnaire

The questionnaire that was handed out to all participants consisted of two parts: The first was the Big Five questionnaire, also called the NEO-FFI [Costa 1992] and the second part was made of a few questions about the participant’s programming experience. The Big Five questionnaire was chosen because it is one of the most popular and widely used psychological tests. It tries to measures five personality dimensions called “Openness to Experience”, “Conscientiousness“, “Extraversion“, “Agreeableness” and “Neuroticism”, each with its own set of questions that altogether sum up to about 60 questions.

Evaluation of the questionnaire data was not planned to be part of the experiment but it made a lot of sense to gather as much additional data as possible for future research and analysis. The NEO-FFI part was included for a possible exploratory analysis of personality traits/types and experiment performance, and the programming experience part was mainly included to serve as additional data for possible grouping and a meta-analysis of participant questionnaires from different experiments. The latter is a continuation of work already done by this author and Hanenberg [Kleinschmager and Hanenberg 2011], where participant questionnaire data was correlated to participants’ performance in the experiments to try to find meaningful connections.

4.4 Hard- and Software Environment

4.4.1 Environment

All participants completed the study on the same Lenovo R60 Thinkpad Computers which were provided by the University, along with a mouse for every laptop.

The prepared software environment was installed on an 8 GB USB Stick. It was a Ubuntu Live Installation of Version 11.04, configured and intended to run and boot only from the stick. The only applications/libraries that were installed apart from the experiment application were XVidCapture (a tool used to record screencasts of the whole experiment for each participant) and the Sun Java Runtime and SDK of Version 1.6_25. The videos from the screen logging application were used as backup for potential problems during the experiment and as a source of information that could give answers to questions which an analysis of the log files alone could not provide.

The experiment IDE (integrated development environment) itself was called Emperior, an editor specifically designed for empirical programming experiments (it was originally created by Steinberg for his master thesis [Steinberg 2011], where it is also explained in much more detail). Emperior provides very simple editing features like a search function or syntax highlighting and also logs different types of information into log files which can be analyzed later.

When the participants clicked on either the “Test” or the “Run” buttons, Unix bash (command-line) and Ant (a Java library for batch and command-line processing, [Apache Foundation]) scripts worked behind the scenes. Both were tasked with setting correct paths variables, compile the application, test projects and return possible run output to the console or call JUnit (a unit testing library for Java, see [JUnit]) to show the test results.

4.4.2 Programming Languages

The two programming languages used were Java and Groovy, whereas Groovy was only used as a “dynamically typed”-Java, meaning that no functions and language features specific to Groovy were used except the dynamic type system. A short summary about both languages is provided here. Java

The third edition of the Java language specification states that “The Java Programming language is a general-purpose concurrent class-based object-oriented language…” [Jones and Kenward 1989]. In [Gosling and McGilton 1996], the design goals of the Java environment are summarized roughly to make Java “simple, object-oriented, and familiar”, “robust and secure”, “architecture neutral and portable”, “high performance”, “interpreted, threaded and dynamic”. It is statically typed and includes automatic memory management by use of a garbage collector and compiles to an intermediary language commonly called “byte code”, which is then run by the Java Virtual Machine. It can run on different operating systems, is generally considered easy to learn and there are no licensing or other costs connected with using it.

Abbildung in dieser Leseprobe nicht enthalten

Listing 4-1: Simple Java Code Example Groovy

Groovy first appeared in 2003 as a dynamically typed scripting language which is based on the Java platform [Strachan 2003] and is since then still under development by an active community (see [Codehaus]). The fact that it is based on Java made it the ideal candidate for the experiment, as it would save a lot of work not having to use completely separate environments, syntax and framework for both parts.

Listing 4-2 demonstrates the dynamic nature of Groovy: The def -keyword is used as a dynamically typed variable which can be assigned an instance of any type at any given time no matter what type has been assigned earlier. In addition, in the case of no return type, void can be omitted. Method parameters do not need to typed, either.

Abbildung in dieser Leseprobe nicht enthalten

Listing 4-2: Simple Groovy Code Example

4.5 Workspace Applications and Tasks

As already stated in the overview, a total of 18 tasks had to be completed by the participants, nine of which needed to be solved in Java (designated 1 through 9) and nine in Groovy (designated 10 through 18). It was chosen in a simple alternating manner whether a participant started with the Groovy or the Java tasks. The applications that the participants had to work on will be called workspace applications.

4.5.1 The Java Application - A Labyrinth Game

The original program that was used as a base for the participants to use as API and to complete their tasks on was a small round-based video game written in Java for this author’s bachelor thesis [Kleinschmager 2009]. Figure 4-2 shows the game window along with some annotations that were given to the participants. In this video game, the player controls a character and has to move through a labyrinth that is riddled with traps and has to get from a start to a goal for each level. Most of the actual game concepts like gaining experience or fighting monsters were not or at least not fully implemented, although enough was finished to give the impression of a real working application.

The video game was then taken apart and customized for the needs of the experiment. Among these changes was the removal/addition of specific classes or methods from the task workspace as well as a complete rework of certain areas of the application. This was to ensure that participants did not accidentally see or use parts of the application that would be required in later tasks. Thus, for each task, certain new methods or classes were added to the workspace application. This heavily modified labyrinth game was the workspace application for the Java part of the experiment.

Abbildung in dieser Leseprobe nicht enthalten

Figure 4-2: The labyrinth game interface along with some annotations for the participants

4.5.2 The Groovy Application - A simple Mail Viewer

After all tasks had been created for the Java application, the second application of the experiment had to be designed in Groovy. This was achieved by taking the now modified first application and renaming all its classes and methods and other code artifacts so that it would turn into an application with a completely different domain model. In this case, it was reformed into a simple e-mail viewer. This approach ensured that for both the Java and Groovy tasks, what ultimately was edited by the participants consisted of an almost identical structure. The concepts of the game were therefore mapped to concepts contained in a mail viewer. E.g. what was a player in the game moving on a game board was now a cursor on the mail document moving along different tags and content.

Additionally, all type information for variables, return types and method parameters was removed from the application to give the impression of an application written in a dynamically typed language. All error messages were rewritten, task descriptions and explanations modified, so that they would not contain the same wording used in the Java application.

Abbildung in dieser Leseprobe nicht enthalten

Figure 4-3: The mail viewer interface along with some annotations for the participants

4.5.3 Important Changes made to both Parts

Some additional restructuring and obfuscation was done to both applications to minimize the danger of participants easily noticing that both the Java and Groovy applications were actually almost the same code just with different names. As was already mentioned above, the variable and method parameter names for both applications were also modified so that variable names did not specifically point to the types they contained. Most of the time synonyms and paraphrasing were used for the variable names to make the APIs really undocumented (removing documentation value of variable names, at least concerning their supposed type). E.g a variable that was to contain an instance of the type LevelType was named “levelKind” so that it did not specifically point the reader to the required class, but still represented a reasonable naming choice in the application domain. One could argue that this kind of renaming might lead to additional noise in the final results, and an even stricter approach would be necessary (like giving them completely useless names like “a”, “b”, “x”). But the assumption was that the scenario should provide a degree of realism that could very probably also be encountered in a real application, where variable names do not point to their types, but a domain artifact.

4.5.4 The Tasks The Task Types

Next is an overview of the task types and the additional variables. There are three task classes into which the tasks are categorized which need to be explained. These categories are type identification, semantic errors and latent type errors, they were designed for the different hypotheses and two of them also introduce an additional independent variable because of this. Each task is explained in more detail later, but the task categories and their relation to the hypotheses are described here.

Type Identification

During tasks of the category type identification the participants had to identify and create a number of class instances of different types and had to put these together in a single instance or multiple instances. E.g. create a new instance of a class that needed two other types via its constructor.

The new independent variable introduced for this task category was the number of types that had to be identified. Tasks 1, 2, 3, 6, 8, (Java) and 10, 11, 12, 15, 17 (Groovy) belong into this category, making up the majority of the tasks. With rising task number, more types had to be identified, ranging from 2 up to 12 types to be found. For all these tasks, the solution had to be put into the provided method of the Task class in the task package.

Tasks of this type were included for hypothesis 1 (the static type system speeds up development time), although their growing number of types to identify was included to provide more granular data to compare between static and dynamic type systems in case results were mixed.

Semantic Errors

Semantic error tasks contain a semantic error in the application that leads to wrong or unexpected behavior. Semantic errors (sometimes also called logic errors) do not lead to compile-time errors like syntactic errors do. An example could be accessing an array index larger than the total array size or while in the wider sense of semantic errors used in this work an example could be also the missing call to a remove method, leading to duplicate references to one object in the program.

Tasks of this type are included for hypothesis 2-2 (no difference for semantic errors between static and dynamic type system), although there was no additional quantifiable variable introduced for these tasks. As a positive side effect, they might have reduced a general learning effect by providing different kinds of tasks to the participants. The tasks 4, 5 (Java) and 13, 14 (Groovy) belong into this category.


[1] Interested individuals might take a look at the article on Wikipedia about the term’s origin: http://en.wikipedia.org/wiki/Debugging

[2] Quite a few studies have been conducted to research debugging approaches, especially the differences between novice and expert programmers. A good starting point is the work of Murphy et al. Murphy et al. [2008] which also references the most important older studies on the topic.

[3] Buss and Henshaw [1992] cite another paper that claims that „some 30-35% of total life-cycle costs are consumed in trying to understand software after it has been delivered, to make changes“. Unfortunately, the original article Hall [1992] could not be retrieved for reviewing.

[4] Also see Lamport and Paulson [1999] or Bracha [2004] for more discussions

[5] His method was published in his book called “Novum Organum”. An English translation is available on http://www.constitution.org/bacon/nov_org.htm

[6] A much more thorough book on research methods and the corresponding evaluation is Bortz and Döring [2006], whose focus is on the social sciences though. Two other books are available that focus on experimentation in software engineering. These are Wohlin [2000] and Juristo and Moreno [2001].


Type of Edition
ISBN (eBook)
File size
1.8 MB
Catalog Number
Empirical Software Research Software Engineering Controlled Experiment Programming Study Java Groovy Static and dynamic Type Systems



Title: Can static type systems speed up programming? An experimental evaluation of static and dynamic type systems