• Publications
  • Influence
On the naturalness of software
TLDR
The conjecture that most software is also natural - in the sense that it is created by humans at work, with all the attendant constraints and limitations - and thus, like natural language, it is also likely to be repetitive and predictable is investigated.
On the naturalness of software
TLDR
The conjecture that most software is also natural - in the sense that it is created by humans at work, with all the attendant constraints and limitations - and thus, like natural language, it is also likely to be repetitive and predictable is investigated.
An Investigation into Coupling Measures for C++
TLDR
A comprehensive suite of measures to quantify the level of class coupling during the design of object-oriented systems takes into account the different 00 design mechanisms provided by the C++ language but it can be tailored to other 00 languages.
The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs
TLDR
The need for a new set of benchmarks is outlined, requirements are outlined, and two datasets, ManyBugs and IntroClass, consisting between them of 1,183 defects in 15 C programs are presented, designed to support the comparative evaluation of automatic repair algorithms asking a variety of experimental questions.
Mining email social networks
TLDR
This paper begins with a discussion of the infrastructure (including a novel use of Scientific Workflow software) and then discusses the approach to mining the email archives, and presents some preliminary results from the data analysis.
Don't touch my code!: examining the effects of ownership on software quality
TLDR
It is found that in all cases, measures of ownership such as the number of low-expertise developers, and the proportion of ownership for the top owner have a relationship with both pre-release faults and post-release failures.
How, and why, process metrics are better
TLDR
It is found that code metrics have high stasis; this leads to stagnation in the prediction models, leading to the same files being repeatedly predicted as defective; unfortunately, these recurringly defective files turn out to be comparatively less defect-dense.
Are deep neural networks the best choice for modeling source code?
TLDR
This work enhances established language modeling approaches to handle the special challenges of modeling source code, such as frequent changes, larger, changing vocabularies, deeply nested scopes, etc, and presents a fast, nested language modeling toolkit specifically designed for software.
Gender and Tenure Diversity in GitHub Teams
TLDR
Using GitHub, the largest publicly available collection of OSS projects, it is shown that both gender and tenure diversity are positive and significant predictors of productivity, together explaining a sizable fraction of the data variability.
A Survey of Machine Learning for Big Code and Naturalness
TLDR
This article presents a taxonomy based on the underlying design principles of each model and uses it to navigate the literature and discuss cross-cutting and application-specific challenges and opportunities.
...
...