In Programming by Example [PBE, also sometimes called "Programming by Demonstration"] systems, the system records actions performed by a user in the interface, and produces a generalized program that can be used later in analogous examples. A key issue is how to describe the actions and objects selected by the user, which determines what kind of generalizations will be possible. When the user selects a graphical object on the screen, most PBE systems describe the object using properties of the underlying application data. For example, if the user selects a link on a web page, the PBE system might represent the selection based on the link's HTML properties. In this article, we explore a different, and radical, approach -using visual properties of the interaction elements themselves, such as size, shape, color, and appearance of graphical objects -to describe user intentions. Only recently has the speed of image processing made feasible real-time analysis of screen images by a PBE system. We have not yet fully realized the goal of a complete PBE system using visual generalization, but we feel the approach is important enough to warrant presenting the idea. Visual information can supplement information available from other sources and opens up the possibility of new kinds of generalizations not possible from the application data alone. In addition, these generalizations can map more closely to the intentions of users, especially beginning users, who rely on the same visual information when making selections. Finally, visual generalization can sometimes remove one of the worst stumbling-blocks preventing the use of PBE with commercial applications, that is, reliance on application APIs. When necessary, PBE systems can work exclusively from visual appearance of applications, and do not need explicit cooperation from the application's API. If you can see it, you should be able to program it Every Programming by Example system has what Halbert  calls the "data description problem" – when the user selects an object on the screen, what do they mean by it? Depending on how you describe an object, it could result in very different effects the next time you run the procedure recorded and generalized by the system. During a demonstration to a PBE system, if you select an icon for a file foo.bar in a desktop file system, did you mean (1) Just that specific file and no other? (2) Any file whose name is foo.bar? (3) Any icon that happened to be found at the location (35, 122) where you clicked? etc. Most systems deal with this issue by mapping the selection on to the application's data model [a set of files, e-mail messages, circles and boxes in a drawing, etc]. They then permit generalizations on the properties of that data [file names, message senders, etc.]. But sometimes the user's intuitive description of an object might depend on the actual visual properties of the screen elements themselves – regardless of whether these properties are explicitly represented in the application's command set. Our proposal is to use these visual properties to permit PBE systems to do "visual generalization". For an example of why visual generalization might prove useful, suppose we want to write a program to save all the links on a Web page that have not been followed by the user at a certain point in time. Figure 1. Can we write a program to save all the unfollowed links? If the Netscape browser happened to have an operation “Move to next unfollowed link”, available as a menu option or in its API, we might be able to automate the activity using a macro recorder such as Quickeys. But unfortunately, Netscape does not have this operation [nor does it even have a “Move to the next link” operation]. Even if we had access to the HTML source of the page, we still wouldn’t know which links had been followed by the user. This is a general problem for PBE systems in interfacing to almost all applications. Interactive applications make it easy for users to carry out procedures, and do not expect to be treated as a subroutine by an external system. This example shows the conceptual gap between a user's view of an application and its underlying programmable functionality. Bridging this gap can be extremely difficulty for a PBE system--its representation of user actions may be a complete mismatch for the user's actual intentions. But perhaps we are looking at this problem from the wrong perspective. From the user's point of view, the functionality of an interactive application is defined by its user interface. The interface has been carefully developed to cover specific tasks, to communicate through appropriate abstractions, to accommodate the cognitive, perceptual, and physical abilities of the user. A PBE system might gain significant benefits if it could work in the same medium as a user, if it could process the visual environment with all its information. This is the key insight we explore in this article. What does visual generalization buy us? Let's imagine a PBE system that incorporates techniques to process a visual interactive environment, to extract information potentially relevant to the user's intentions. What does the system gain from these capabilities? • Integration into existing environments. Historically, most PBE systems have been built on top of isolated research systems, rather than commercial applications. Some have been promising, but haven not been adopted because of the difficulty of integration. A visual PBE system, independent of source code and API constraints, could potentially reach an unlimited audience. • Consistency. Independence of an application's source code or API also gives a PBE system flexibility. Similar applications often have similar appearance and behavior; for example, users switch between Web browsers with little difficulty. A visual PBE system could take advantage of functional and visual consistency to operate across similar applications with little or no modification. • New sources of information. Most importantly, some kinds of visual information may be difficult or impossible to obtain through other means. Furthermore, this information is generally closely related to the user's understanding of an application. These are all benefits to the developers of a PBE system, but they apply equally well to the users of a PBE system. In the Netscape example, a visual PBE system would be able to run on top of the existing browser, without requiring the use of a substitute research system. Because Netscape has the convention of displaying the followed links in red and the unfollowed links in blue, a user might specify the "Save the next unfollowed link" action in visual terms as “Move to the next line of blue text, then invoke the Save Link As operation”. This specification exploits a new, visual source of information. Finally, the general consistency between browsers should allow the same system to work with both Netscape and Microsoft Internet Explorer, a much trickier proposition for API-based systems. Providing a visual processing capability raises some novel challenges for a PBE system: • Image processing: How can a system extract visual information at the image processing level in practice? This processing must happen in an interactive system, interleaved with user actions and observation of the system, which raises significant efficiency issues. This an issue of the basic technical feasibility of a visual approach to PBE. Our experience with VisMap, below, shows that real-time analysis of the screen is feasible on today's high-end machines. • Information management: How can a system process low-level visual data to infer high-level information relevant to user intentions? For example, a visual object under the mouse pointer might be represented as a rectangle, a generic window region, or a window region specialized for some purpose, such as illustration. A text box with a number in it might be an element of a fill-in form, a table in a text document, or a cell in a spreadsheet. This concern is also important for generalization from low-level events to the abstractions they implement: is the user simply clicking on a rectangle or performing a confirmation action? • Brittleness: How can a system deal gracefully with visual variations that are beyond the scope of a solution? In the Netscape example of collecting unfollowed links, users may, in fact, change the colors which Netscape uses to display followed vs. unfollowed links, thereby perhaps obsoleting a previously recorded procedure. A link may in fact extend over more than a single line of text, so that the mapping between lines and links is not exact. Similar blue text might appear in a GIF image and be inadvertently captured by the procedure. And, if the program is visually parsing the screen, links that do not appear because they are below the current scrolling position will not be included. Out of sight, out of mind! Though the latter problem might be cured by programming a loop which scrolled through the page as the user would. It puts most of these problems in a novel light if we observe that they can be difficult even for a human to solve. Almost everyone has been fooled now and then by advertising graphics that camouflage themselves as legitimate interface objects; without further information (such as might be provided by an API call) a visual PBE system cannot hope to do better. Low-level visual generalization: Just the pixels, ma'am Potter’s work on pixel-based data access pioneered the approach of treating the screen image as the source for generating descriptions for generalization. The TRIGGERS system  performs exact pattern matching on screen pixels to infer information that is otherwise unavailable to an external system. A “trigger” is a condition/action pair. For example, triggers are defined for such tasks as surrounding a text field with a rounded rectangle in a drawing program, shortening lines so that they intersect an arbitrary shape, and converting text to a bold typeface. The user defines a trigger by stepping through a sequence of actions in an application, adding annotations for the TRIGGERS system when appropriate. Once a set of triggers have been defined, the user can activate them, iteratively and exhaustively, to carry out their actions. Several strategies can be used to process visual pixel information so that it can be used to generalize computer programs. The strategy used by TRIGGERS is to compute locations of exact patterns within the screen image. For example, suppose a user records a mouse macro that modifies a URL in order to display the next higher directory in a web browser. Running the macro can automate this process, but only for the one specific URL because the mouse locations are recorded with fixed coordinates. However, this macro can be generalized by using pixel pattern matching on the screen image. The pattern to use is what a user would look for if doing the task manually: the pixel pattern of a slash character. Finding the second to the last occurrence of this pattern gives a location from which the macro can begin the macro's mouse drag, which generalizes the macro so that it will work with most URLs. Step 1 Select URL text field: Step 2 Start mouse drag to select deepest directory: Step 3 Finish mouse drag: Step 4 Press backspace to delete the selection: Figure 2. Steps in a mouse macro to move a browser up one directory, and selecting a pixel pattern that can generalize the macro. Even though this macro program affects data such as characters, strings, URLs, and web pages, the program's internal data is only low-level pixel patterns and screen coordinates. It is the use within the rich GUI context that gives higher-level meaning to the low-level data. The fact that a low-level program can map so simply to a much higher-level meaning attests to how conveniently the visual information of a GUI is organized for productive work.  gives more examples. The advantage of this strategy is that the low-level data and operators of the programming system can map to many high-level meanings, even ones not originally envisioned by the programming system developer. The disadvantage is that high-level internal processing of the information is difficult, since the outside context is required for most interpretation. Another system that performs data access at the pixel level is Yamamoto's AutoMouse , which can search the screen for rectangular pixel patterns and click anywhere within the pattern. Copies of the patterns can be arranged on a document and connected to form simple visual programs. Each pattern can have different mouse and keyboard actions associated with it. High-level visual generalization: What you see is what you record Zettlemoyer and St. Amant's VisMap  is in some ways a conceptual successor to TRIGGERS. VisMap is a programmable set of sensors, effectors, and skeleton controllers for visual interaction with off-the-shelf applications. Sensor modules take pixel-level input from the display, run the data through image processing algorithms, and build a structured representation of visible interface objects. Effector modules generate mouse and keyboard gestures to manipulate these objects. VisMap is designed as a programmable user model, an artificial user with which developers can explore the characteristics of a user interface. VisMap is not, by itself, a Programming by Example system. But it does demonstrate that visual generalization is practical in an interface, and we hope to apply its approach in a full PBE system. VisMap translates the pixel information to data types that have more meaning outside of the GUI context. For example, building on VisMap we have developed VisSolitaire, a simple visual application that plays Microsoft Windows Solitaire. VisMap translates the pixel information to data types that represent the state of a generic game of Solitaire. This state provides input to an AI planning system that plays a reasonable game of solitaire, from the starting deal to a win or loss. It does not use an API or otherwise have any cooperation from Microsoft Solitaire. VisSolitaire's control cycle alternates between screen parsing and generalized action. VisSolitaire processes the screen image to identify cards and their positions. When the cards are located, a visual grammar characterizes them based on relative location and visual properties. In this way the system can identify the stacks of cards that form the stock, tableau, and foundation, as well as classify each card based on visual identification of its suit and rank, as shown below.