<lb>As statistical machine learning algorithms and techniques<lb>continue to mature, many researchers and developers see<lb>statistical machine learning not only as a topic of expert<lb>study, but also as a tool for software development. Extensive<lb>prior work has studied software development, but little prior<lb>work has studied software developers applying statistical<lb>machine learning. This paper presents interviews of eleven<lb>researchers experienced in applying statistical machine<lb>learning algorithms and techniques to human-computer<lb>interaction problems, as well as a study of ten participants<lb>working during a five-hour study to apply statistical machine<lb>learning algorithms and techniques to a realistic problem.<lb>We distill three related categories of difficulties that arise<lb>in applying statistical machine learning as a tool for<lb>software development: (1) difficulty pursuing statistical<lb>machine learning as an iterative and exploratory process,<lb>(2) difficulty understanding relationships between data and<lb>the behavior of statistical machine learning algorithms,<lb>and (3) difficulty evaluating the performance of statistical<lb>machine learning algorithms and techniques in the context<lb>of applications. This paper provides important new insight<lb>into these difficulties and the need for development tools that<lb>better support the application of statistical machine learning. Author Keywords<lb>Statistical machine learning, software development. ACM Classification Keywords<lb>H5.2 Information Interfaces and Presentation: User Interfaces;<lb>D2.6 Programming Environments: Integrated Environments. INTRODUCTION AND MOTIVATION<lb>Statistical machine learning has emerged as an important<lb>tool in the development of modern software. For example,<lb>the explosion of information available on the Web has<lb>motivated work on interfaces that better support common<lb>tasks by automatically identifying relationships within and<lb>among Web pages [11, 13]. Concern for demands on<lb>human attention imposed by computing and communication<lb>systems has prompted the examination of sensor-based Permission to make digital or hard copies of all or part of this work for<lb>personal or classroom use is granted without fee provided that copies are<lb>not made or distributed for profit or commercial advantage and that copies<lb>bear this notice and the full citation on the first page. To copy otherwise, or<lb>republish, to post on servers or to redistribute to lists, requires prior specific<lb>permission and/or a fee.<lb>CHI 2008, April 5 10, 2008, Florence, Italy.<lb>Copyright 2008 ACM 1-59593-178-3/07/0004...$5.00.<lb>statistical models of human interruptibility [7, 12]. The<lb>complexity of modern software has led to the exploration<lb>of statistical techniques for detecting bugs and anomalous<lb>behavior in deployed systems [2, 28]. Advances in low-cost<lb>sensing have prompted work on activity modeling and<lb>location-aware computing to enhance the limited input and<lb>output capabilities of mobile devices [17, 19]. Statistical machine learning algorithms and techniques<lb>remain an important topic of expert study, but the above<lb>examples have the common characteristic that they are<lb>focused on applying existing algorithms and techniques<lb>to solve a problem of interest. In this sense, there<lb>are many researchers and developers who see statistical<lb>machine learning not as a topic of study, but rather as a<lb>tool for software development. Extensive prior research<lb>has explored the general difficulties faced by software<lb>developers [14, 15, 18, 24, 25], but little work has studied<lb>the application of statistical machine learning in software<lb>development. Instead, existing statistical machine learning<lb>tools (e.g. [22, 27]) have generally been developed by<lb>and for statistical machine learning experts, whose primary<lb>interest is often in developing and evaluating new statistical<lb>machine learning algorithms and techniques. But the<lb>application of statistical machine learning is no longer<lb>limited to such experts, and so it is important to understand<lb>the difficulties that developers face when applying statistical<lb>machine learning as a tool for software development. This paper takes a two-pronged approach to examining<lb>the difficulties that arise in applying statistical machine<lb>learning as a tool for software development. We<lb>first interview eleven researchers, each of whom has<lb>significant experience applying statistical machine learning<lb>algorithms and techniques to human-computer interaction<lb>research problems. We then study ten participants<lb>working during a five-hour study to apply statistical<lb>machine learning algorithms and techniques to a realistic<lb>problem. Our interviews and our study reveal three related<lb>categories of difficulties: (1) difficulty pursuing statistical<lb>machine learning as an iterative and exploratory process,<lb>(2) difficulty understanding relationships between data and<lb>the behavior of statistical machine learning algorithms,<lb>and (3) difficulty evaluating the performance of statistical<lb>machine learning algorithms and techniques in the context<lb>of applications. Analyzing detailed screen and workspace<lb>captures of the work of study participants in the context of difficulties distilled from our interviews, this paper provides<lb>important new insight into these difficulties and discusses<lb>implications for new tools to better support the application<lb>of statistical machine learning. RELATED WORK<lb>Several texts provide appropriate introductions to statistical<lb>machine learning algorithms and techniques [10, 23]. One of<lb>the most well-studied areas of statistical machine learning is<lb>the learning of a function y = f(x1, . . . ,<lb>xn). Depending on<lb>whether y is a continuous or nominal variable, this problem<lb>is known as either regression or classification. Both are<lb>considered supervised learning, because a set of labels are<lb>provided as part of training data at the time the function<lb>is learned. Variables x1, . . . ,<lb>xn are known as features,<lb>and each should capture some useful aspect of the problem<lb>being modeled. Importantly, features are assumed as input<lb>to statistical machine learning algorithms, and must be<lb>provided by the developer. Existing tools for the application of statistical machine<lb>learning provide a library of implementations of common<lb>statistical machine learning algorithms, with Weka being a<lb>well-known and widely-used example . Environments<lb>like YALE  provide additional support for configuring<lb>experiments to compare the performance of different<lb>potential algorithms. Such tools can save a statistical<lb>machine learning expert significant implementation effort,<lb>but they provide little guidance to developers who are<lb>not already familiar with how to successfully employ the<lb>provided algorithms. The body of this paper further<lb>discusses difficulties faced by developers in using current<lb>tools, as our second study includes the use of Weka as part<lb>of solving a classification problem. A number of systems have explored the packaging<lb>of appropriate features with statistical machine learning<lb>algorithms to ease development in particular domains.<lb>For example, Fails and Olsen developed Crayons, a<lb>tool for creating camera-based systems using pixel-level<lb>classifiers [4, 5]. Crayons uses a coloring metaphor<lb>to collect labeled training data, then learns a decision<lb>tree classifier using features based on integral images.<lb>Hartmann et al. developed Exemplar , which supports<lb>the interactive specification of sensor-based recognizers<lb>through a combination of signal filters and a dynamic time<lb>warping algorithm. Fogarty and Hudson developed Subtle<lb>, a system focused on the sensing capabilities of typical<lb>laptop computers that uses operator-based feature generation<lb>with wrapper-based feature selection to automatically create<lb>classifiers based on labels provided by an application.<lb>Maynes-Aminzade et al. present Eyepatch, a tool for<lb>developing camera-based interactions that includes support<lb>for training different types of vision-based classifiers . Such tools demonstrate the potential for the application of<lb>statistical machine learning algorithms and techniques, but<lb>they generally achieve their success by highly constraining<lb>both their application domain and the approaches that a<lb>developer can take to a problem. One of the most common<lb>constraints is to limit the developer to providing training<lb>data, an approach taken by both Crayons and Subtle. Both<lb>systems package a set of features and algorithms that work<lb>well in their domains, allowing the developer to provide<lb>examples of the concept they want to model. But if the<lb>packaged features and modeling algorithm are not a good<lb>fit for the model that a developer wants to learn, such<lb>tools provide little recourse. Similarly, Exemplar allows a<lb>developer to explicitly manipulate parameters to a dynamic<lb>time warping algorithm, but provides little support for<lb>determining whether this is the appropriate algorithm for<lb>a problem and no support for experimenting with other<lb>potential algorithms. In short, systems constrained in<lb>such ways provide a low floor (a low barrier to entry)<lb>at the expense of a low ceiling (the point at which the<lb>tool’s assumptions and constraints become an obstacle to<lb>addressing a problem) . In contrast, this work explores<lb>support for the end-to-end development of systems based on<lb>the application of statistical machine learning algorithms and<lb>techniques. By examining the difficulties that developers<lb>encounter, we aim to inform the development of new tools<lb>that provide both a low floor and a high ceiling. Ko et al. identify six learning barriers faced by novice<lb>developers . Although our focus is on experienced<lb>software developers who are not experts in the application<lb>of statistical machine learning algorithms and techniques,<lb>analogous barriers arise. For example, a developer who<lb>cannot conceive of how to frame a problem as a matter<lb>of learning a function y = f(x1, . . . ,<lb>xn) is encountering<lb>a barrier analogous to Ko et al.’s design barriers. Our<lb>work complements such work, examining in greater depth<lb>the unique difficulties encountered in the application of<lb>statistical machine learning. Other areas of related work include studies of engineering<lb>and creative design processes [3, 26], the challenges<lb>of effective information visualization , and work in<lb>knowledge discovery and data mining . We note that<lb>results in all of these areas will inform the design of new<lb>tools to better support the effective application of statistical<lb>machine learning as a tool for software development. But<lb>we also note that statistical machine learning continues to<lb>grow in importance as a tool for software development, and<lb>so tools supporting the effective application of statistical<lb>machine learning warrant careful attention. As an analogy,<lb>it is clear that user interface toolkits and studies of the<lb>development of user interface software have been critical<lb>to advancing human-computer interaction . Because<lb>statistical machine learning continues to emerge as an<lb>important tool in human-computer interaction and in other<lb>fields, our work aims to enable similar long-term impact via<lb>a first set of empirical studies examining the difficulties that<lb>developers face in appplying statistical machine learning. STUDY OVERVIEWS<lb>As noted in our introduction, we take a two-pronged<lb>approach to examining the difficulties that arise in<lb>applying statistical machine learning. We first conduct<lb>semi-structured interviews of eleven researchers experienced<lb>in applying statistical machine learning to human-computer<lb>interaction research problems. These interviews provide<lb>high-level insight into the difficulties faced by developers. Based on this high-level insight, we then design and<lb>conduct a laboratory think-aloud examining ten participants<lb>working during a five-hour period to apply statistical<lb>machine learning algorithms and techniques to a realistic<lb>problem. We chose this combination of approaches because<lb>the development of software based on statistical machine<lb>learning algorithms and techniques typically takes place<lb>over an extended period of time, on the order of weeks<lb>to months. Starting with interviews allows a high-level<lb>exploration of difficulties that arise in applying statistical<lb>machine learning, and our lab study then probes exactly<lb>how how those difficulties manifest as developers work on<lb>a problem. This section introduces our interviews and our<lb>study, deferring results until later sections. Semi-Structured Interviews<lb>We interviewed eleven researchers with significant experience<lb>applying statistical machine learning algorithms and<lb>techniques to human-computer interaction problems. To<lb>avoid confusion when discussing our two groups of<lb>participants, this paper refers to our interview participants<lb>as IP1 through IP11. As an indication of the breadth of<lb>their experience, we note that these researchers have worked<lb>on such problems as intelligent digital photo management,<lb>vision-based facial expression recognition, availability<lb>modeling in instant messaging, EEG-based recognition of<lb>brain activity, RFID-based activity recognition for elder care<lb>applications, accelerometer-based activity recognition for<lb>fitness applications, mixed-initiative pen-based text input,<lb>programming-by-demonstration approaches to text editing,<lb>interactive tools for creating camera-based interfaces,<lb>automated network packet diagnosis, and models of musical<lb>style. Each interview participant has published multiple<lb>papers in top venues. Participants were selected to include<lb>a mixture of backgrounds, including researchers from the<lb>statistical machine learning community who are focused<lb>on human-computer interaction applications and researchers<lb>from the human-computer interaction community who have<lb>incorporated statistical machine learning in their work. Each participant recalled two to three prior projects that<lb>included the application of statistical machine learning<lb>algorithms and techniques. They then described the lifecycle<lb>of the project, from conception to completion, as well as<lb>how the application of statistical machine learning related<lb>to other aspects of the project. We asked participants to<lb>diagram their process while describing the project, and we<lb>elicited further discussion and clarification by concurrently<lb>annotating and editing their diagrams. After discussing this<lb>first project, participants compared and contrasted it with<lb>the other projects they had initially discussed. Interviews<lb>lasted for between 40 and 90 minutes, and we captured audio<lb>recordings for later transcription and review. Interview Results Overview<lb>Although we defer the bulk of discussion until after the<lb>presentation of our think-aloud study, several results from<lb>our interviews directly inform the design of our think-aloud<lb>study. First, our interviews revealed that the application<lb>of statistical machine learning is a highly iterative and<lb>exploratory process. A typical process requires the<lb>formulation of a learning problem, collection of appropriate<lb>training data, the extraction of features from the data, the<lb>selection of a modeling algorithm, and experimentation to<lb>determine whether the resulting system meets the needs of<lb>the application. Although these steps describe a set of linear<lb>dependencies, our participants emphasized the fact that the<lb>actual development of a system is much more exploratory<lb>than such linear dependencies suggest. This led us to ensure<lb>that our think-aloud study examined this entire process (as<lb>opposed to, for example, focusing only on developer model<lb>selection given a predetermined set of features). Second,<lb>our interviews revealed the importance of understanding<lb>relationships between data and the behavior of statistical<lb>machine learning algorithms in order to decide how to<lb>proceed. We therefore desired a modeling problem that<lb>can be effectively solved using many different approaches,<lb>as opposed to one that forced participants towards a single<lb>effective solution. Third, our interviews revealed difficulties<lb>with evaluating the performance of statistical machine<lb>learning in the context of applications, as our interview<lb>participants felt that they must often manage concerns other<lb>than the straightforward notion of model accuracy. We<lb>therefore designed our think-aloud study to examine two<lb>other concerns that interview participants raised, a need for<lb>systems to work well when used by different people and<lb>a need to consider computational cost and implications for<lb>responsiveness when developing interactive applications. Think-Aloud Study<lb>Based on the difficulties described by our interview<lb>participants, we designed the Digits task to examine<lb>how these difficulties manifest as developers work<lb>on a problem. This subsection presents the task,<lb>the development environment used by participants, our<lb>experimental procedure, and our participants.