Extending Unix Pipelines to DAGs

@article{Spinellis2017ExtendingUP,
  title={Extending Unix Pipelines to DAGs},
  author={Diomidis D. Spinellis and Marios Fragkoulis},
  journal={IEEE Transactions on Computers},
  year={2017},
  volume={66},
  pages={1547-1561}
}
The Unix shell dgsh provides an expressive way to construct sophisticated and efficient non-linear pipelines. Such pipelines can use standard Unix tools, as well as third-party and custom-built components. Dgsh allows the specification of pipelines that perform non-uniform non-linear processing. These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the processing task’s throughput. A number of existing Unix tools have been adapted… 

Figures and Tables from this paper

An order-aware dataflow model for parallel Unix pipelines
TLDR
A dataflow model for modelling parallel Unix shell pipelines is presented, and the semantics of transformations that exploit data parallelism available in Unix shell computations are captured and proved to be correctness.
PaSh: light-touch data-parallel shell processing
TLDR
PaSh is presented, a system for parallelizing POSIX shell scripts that adds POSIX constructs to explicitly guide parallelism coupled with PaSh-provided Unix-aware runtime primitives for addressing performance- and correctness-related issues.
An Order-aware Dataflow Model for Extracting Shell Script Parallelism
TLDR
This work uses a large number of real scripts to evaluate the parallel performance delivered by the dataflow transformations, including the contributions of individual transformations, achieving an average speedup of 6.14× and a maximum of 61.1× on a 64-core machine.
Automatic synthesis of parallel unix commands and pipelines with KumQuat
TLDR
KumQuat automatically synthesizes the combine operators, with a domain-specific combiner language acting as a strong regularizer that promotes efficient inference of correct combiners and enables the effective parallelization of the authors' benchmark scripts.
Streamlining the Genomics Processing Pipeline via Named Pipes and Persistent Spark Satasets
TLDR
It is demonstrated that Ignite can improve the runtime performance of in-memory RDD actions and that keeping pipeline components in memory with Ignite and named pipes eliminates a major I/O bottleneck.
Practically Correct, Just-in-Time Shell Script Parallelization
TLDR
Results show that PASH-JIT can be used as a drop-in replacement for any non-interactive shell use, providing significant speedups without any risk of breakage, whenever parallelization is possible.
Automatic Synthesis of Parallel and Distributed Unix Commands with KumQuat
We present KumQuat, a system for automatically synthesizing parallel and distributed versions of Unix shell commands. KumQuat follows a divide-and-conquer approach, decomposing commands into (i) a
How to Analyze Git Repositories with Command Line Tools: We're not in Kansas Anymore
  • D. Spinellis, Georgios Gousios
  • Computer Science
    2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion)
  • 2018
TLDR
This work examines the tools and techniques that can be most effectively used to perform the task at hand of Git data analytics on the command-line through a pattern that involves fetching, selection, processing, summarization, and reporting.
The Once and Future Shell
TLDR
Improving the UNIX shell holds much promise for development, ops, and data processing; several avenues of research building on recent advances are outlined.
Unix shell programming: the next 50 years
TLDR
This paper aims to help manage the shell's essential shortcomings (dynamism, power, and abstruseness) and address its inessential ones.
...
...

References

SHOWING 1-10 OF 43 REFERENCES
Composing and executing parallel data-flow graphs with shell pipes
TLDR
These extensions enable the implementation of a class of data-flow computation with strong deterministic properties, and provide a simple yet powerful coordination layer for leveraging multi-language and legacy components for large-scale parallel computation.
Unix time-sharing system: the unix shell
  • S. R. Bourne
  • Computer Science
    The Bell System Technical Journal
  • 1978
The UNIX∗ shell is a command programming language that provides an interface to the UNIX operating system. It contains several mechanisms found in algorithmic languages such as control-flow
NMRPipe: A multidimensional spectral processing system based on UNIX pipes
TLDR
The asynchronous pipeline scheme provides other substantial advantages, including high flexibility, favorable processing speeds, choice of both all-in-memory and disk-bound processing, easy adaptation to different data formats, simpler software development and maintenance, and the ability to distribute processing tasks on multi-CPU computers and computer networks.
Dryad: distributed data-parallel programs from sequential building blocks
TLDR
The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Support for graphs of processes in a command interpreter
TLDR
This paper explains why linear pipe structures are too restrictive and how the standard UNIX shells have dictated the interface strategy of new commands, and presents a new command interpreter that supports general directed process graphs.
The synchronous data flow programming language LUSTRE
The authors describe LUSTRE, a data flow synchronous language designed for programming reactive systems-such as automatic control and monitoring systems-as well as for describing hardware. The data
The UNIX® system document preparation tools: A retrospective
TLDR
This paper examines the family of programs in the UNIX system's document-preparation suite, focusing on their most characteristic aspects, and on the lessons they have learned about both document preparation and software development.
Data flow languages
  • W. Ackerman
  • Computer Science
    1979 International Workshop on Managing Requirements Knowledge (MARK)
  • 1979
TLDR
There are several computer system architectures which have the goal of exploiting parallelism—multiprocessors, vector machines and array processors—and there have been attempts to design compilers to optimize programs written in conventional languages (e.g. "vectorizing" compilers for the FORTRAN language).
Unix tools as visual programming components in a GUI‐builder environment
TLDR
It is described how specially designed reflective components can be used in an industry‐standard visual programming environment to graphically specify sophisticated data transformation pipelines that interact with GUI elements.
Dataflow process networks
TLDR
Dataflow process networks are shown to be a special case of Kahn process networks, a model of computation where a number of concurrent processes communicate through unidirectional FIFO channels, where writes to the channel are nonblocking, and reads are blocking.
...
...