Facilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models

@article{Park2021FacilitatingKS,
  title={Facilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models},
  author={Soya Park and April Yi Wang and Ban Kawas and Qingzi Vera Liao and David Piorkowski and Marina Danilevsky},
  journal={26th International Conference on Intelligent User Interfaces},
  year={2021}
}
Data scientists face a steep learning curve in understanding a new domain for which they want to build machine learning (ML) models. While input from domain experts could offer valuable help, such input is often limited, expensive, and generally not in a form readily consumable by a model development pipeline. In this paper, we propose Ziva, a framework to guide domain experts in sharing essential domain knowledge to data scientists for building NLP models. With Ziva, experts are able to… 

Figures and Tables from this paper

Bridging Multi-disciplinary Collaboration Challenges in ML Development via Domain Knowledge Elicitation
TLDR
Ziva is introduced, an interface for supporting domain knowledge from domain experts to data scientists in two ways: a concept creation interface where domain experts extract important concept of the domain and five kinds of justification elicitation interfaces that solicit elicitation how the domain concept are expressed in data instances.
“It’s Like the Value System in the Loop”: Domain Experts’ Values Expectations for NLP Automation
TLDR
The study findings provide groundwork for the inclusion of domain experts values whose expertise lies outside of the field of computing into the design of automated NLP systems.
Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process
TLDR
This work identifies key collaboration challenges that teams face when building and deploying ML systems into production, and finds that most of these challenges center around communication, documentation, engineering, and process, and collects recommendations to address these challenges.
Crystalline: Lowering the Cost for Developers to Collect and Organize Information for Decision Making
TLDR
A new system called Crystalline is introduced that automatically collects and organizes information into tabular structures as the user searches and browses the web, and uses passive behavioral signals to infer what information to collect and how to visualize and prioritize it.
More Engineering, No Silos: Rethinking Processes and Interfaces in Collaboration between Interdisciplinary Teams for Machine Learning Projects
TLDR
Key collaboration challenges that teams face when building and deploying ML systems into production are identified and most of these challenges center around communication, documentation, engineering, and process and recommendations to address these challenges are collected.
Trade-offs in Sampling and Search for Early-stage Interactive Text Classification
TLDR
It is shown that supplementing early-stage sampling with user-guided text search can effectively “seed” a classifier with positive documents without compromising generalization performance—particularly for imbalanced tasks where positive documents are rare.
How AI Developers Overcome Communication Challenges in a Multidisciplinary Team
TLDR
Using the analytic lens of shared mental models, this paper reports on the types of communication gaps that AI developers face, how AI developers communicate across disciplinary and organizational boundaries, and how they simultaneously manage issues regarding trust and expectations.
Empathosphere: Promoting Constructive Communication in Ad-hoc Virtual Teams through Perspective-taking Spaces
TLDR
Empathosphere is introduced, a chat-embedded intervention to mitigate social barriers and foster constructive communication in teams and demonstrates that “experimental spaces,” particularly those that integrate methods of encouraging perspective-taking, can be a powerful means of improving communication in virtual teams.
How Stimulating Is a Green Stimulus? The Economic Attributes of Green Fiscal Spending
When deep recessions hit, some governments spend to rescue and recover their economies. Key economic objectives of such countercyclical spending include protecting and creating jobs while
How Domain Experts Work with Data: Situating Data Science in the Practices and Settings of Craftwork
TLDR
Drawing on an ethnographic study of a craft brewery in Korea, it is shown how craft brewers worked with data by situating otherwise abstract data within their brewing practices and settings.

References

SHOWING 1-10 OF 94 REFERENCES
Explainable Active Learning (XAL): An Empirical Study of How Local Explanations Impact Annotator Experience
TLDR
This study shows benefits of AI explanation as interfaces for machine teaching--supporting trust calibration and enabling rich forms of teaching feedback, and potential drawbacks--anchoring effect with the model judgment and cognitive workload.
Gamut: A Design Probe to Understand How Data Scientists Understand Machine Learning Models
TLDR
This investigation investigated why and how professional data scientists interpret models, and how interface affordances can support data scientists in answering questions about model interpretability, and showed that interpretability is not a monolithic concept.
How do Data Science Workers Collaborate? Roles, Workflows, and Tools
TLDR
It is found that data science teams are extremely collaborative and work with a variety of stakeholders and tools during the six common steps of a data science workflow (e.g., clean data and train model).
Snorkel: Rapid Training Data Creation with Weak Supervision
TLDR
Snorkel is a first-of-its-kind system that enables users to train state- of- the-art models without hand labeling any training data and proposes an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution.
Machine Teaching by Domain Experts: Towards More Humane, Inclusive, and Intelligent Machine Learning Systems
This paper argues that a possible way to escape from the limitations of current machine learning (ML) systems is to allow their development directly by domain experts without the mediation of ML
Neural Ranking Models with Weak Supervision
TLDR
This paper proposes to train a neural ranking model using weak supervision, where labels are obtained automatically without human annotators or any external resources, and suggests that supervised neural ranking models can greatly benefit from pre-training on large amounts of weakly labeled data that can be easily obtained from unsupervised IR models.
The Emerging Role of Data Scientists on Software Development Teams
TLDR
Five distinct working styles of data scientists are identified: Insight Providers, who work with engineers to collect the data needed to inform decisions that managers make; Modeling Specialists, who use their machine learning expertise to build predictive models; Platform Builders, who create data platforms, balancing both engineering and data analysis concerns; and Team Leaders, who run teams of data Scientists and spread best practices.
How Data Scientists Use Computational Notebooks for Real-Time Collaboration
TLDR
How synchronous editing in computational notebooks changes the way data scientists work together compared to working on individual notebooks is reported and several design implications aimed at better supporting collaborative editing in synchronous notebooks are proposed, thus improving efficiency in teamwork among data scientists.
Manifold: A Model-Agnostic Framework for Interpretation and Diagnosis of Machine Learning Models
TLDR
Manifold is presented, a framework that utilizes visual analysis techniques to support interpretation, debugging, and comparison of machine learning models in a more transparent and interactive manner and is designed as a generic framework.
Structured Labeling to Facilitate Concept Evolution in Machine Learning
TLDR
The notion of concept evolution, the changing nature of a person’s underlying concept which can result in inconsistent labels and thus be detrimental to machine learning is introduced.
...
...