Extending Unix Pipelines to DAGs

@article{Spinellis2017ExtendingUP,
  title={Extending Unix Pipelines to DAGs},
  author={Diomidis D. Spinellis and Marios Fragkoulis},
  journal={IEEE Transactions on Computers},
  year={2017},
  volume={66},
  pages={1547-1561}
}
The Unix shell dgsh provides an expressive way to construct sophisticated and efficient non-linear pipelines. Such pipelines can use standard Unix tools, as well as third-party and custom-built components. Dgsh allows the specification of pipelines that perform non-uniform non-linear processing. These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the processing task’s throughput. A number of existing Unix tools have been adapted… 

Figures and Tables from this paper

An Order-aware Dataflow Model for Extracting Shell Script Parallelism
TLDR
This work uses a large number of real scripts to evaluate the parallel performance delivered by the dataflow transformations, including the contributions of individual transformations, achieving an average speedup of 6.14× and a maximum of 61.1× on a 64-core machine.
Automatic synthesis of parallel unix commands and pipelines with KumQuat
TLDR
KumQuat automatically synthesizes the combine operators, with a domain-specific combiner language acting as a strong regularizer that promotes efficient inference of correct combiners and enables the effective parallelization of the authors' benchmark scripts.
Streamlining the Genomics Processing Pipeline via Named Pipes and Persistent Spark Satasets
TLDR
It is demonstrated that Ignite can improve the runtime performance of in-memory RDD actions and that keeping pipeline components in memory with Ignite and named pipes eliminates a major I/O bottleneck.
Automatic Synthesis of Parallel and Distributed Unix Commands with KumQuat
We present KumQuat, a system for automatically synthesizing parallel and distributed versions of Unix shell commands. KumQuat follows a divide-and-conquer approach, decomposing commands into (i) a
How to Analyze Git Repositories with Command Line Tools: We're not in Kansas Anymore
  • D. Spinellis, Georgios Gousios
  • Computer Science
    2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion)
  • 2018
TLDR
This work examines the tools and techniques that can be most effectively used to perform the task at hand of Git data analytics on the command-line through a pattern that involves fetching, selection, processing, summarization, and reporting.
The Once and Future Shell
TLDR
Improving the UNIX shell holds much promise for development, ops, and data processing; several avenues of research building on recent advances are outlined.
Unix shell programming: the next 50 years
TLDR
This paper aims to help manage the shell's essential shortcomings (dynamism, power, and abstruseness) and address its inessential ones.
The future of the shell: Unix and beyond
TLDR
This 90-minute panel brings together researchers and engineers from disparate communities to think about the Unix shell's strengths and weaknesses, challenges and opportunities around the shell, and the shell's future.
Report on the "The Future of the Shell" Panel at HotOS 2021
This document summarizes the challenges and possible research directions around the shell and its ecosystem, collected during and after the HotOS21 Panel on the future of the shell. The goal is to
An order-aware dataflow model for parallel Unix pipelines
TLDR
A dataflow model for modelling parallel Unix shell pipelines is presented, and the semantics of transformations that exploit data parallelism available in Unix shell computations are captured and proved to be correctness.
...
1
2
...

References

SHOWING 1-10 OF 44 REFERENCES
Composing and executing parallel data-flow graphs with shell pipes
TLDR
These extensions enable the implementation of a class of data-flow computation with strong deterministic properties, and provide a simple yet powerful coordination layer for leveraging multi-language and legacy components for large-scale parallel computation.
Unix time-sharing system: the unix shell
  • S. R. Bourne
  • Computer Science
    The Bell System Technical Journal
  • 1978
The UNIX∗ shell is a command programming language that provides an interface to the UNIX operating system. It contains several mechanisms found in algorithmic languages such as control-flow
NMRPipe: A multidimensional spectral processing system based on UNIX pipes
TLDR
The asynchronous pipeline scheme provides other substantial advantages, including high flexibility, favorable processing speeds, choice of both all-in-memory and disk-bound processing, easy adaptation to different data formats, simpler software development and maintenance, and the ability to distribute processing tasks on multi-CPU computers and computer networks.
Dryad: distributed data-parallel programs from sequential building blocks
TLDR
The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Support for graphs of processes in a command interpreter
TLDR
This paper explains why linear pipe structures are too restrictive and how the standard UNIX shells have dictated the interface strategy of new commands, and presents a new command interpreter that supports general directed process graphs.
The synchronous data flow programming language LUSTRE
The authors describe LUSTRE, a data flow synchronous language designed for programming reactive systems-such as automatic control and monitoring systems-as well as for describing hardware. The data
The UNIX® system document preparation tools: A retrospective
TLDR
This paper examines the family of programs in the UNIX system's document-preparation suite, focusing on their most characteristic aspects, and on the lessons they have learned about both document preparation and software development.
Data flow languages
  • W. Ackerman
  • Computer Science
    1979 International Workshop on Managing Requirements Knowledge (MARK)
  • 1979
TLDR
There are several computer system architectures which have the goal of exploiting parallelism—multiprocessors, vector machines and array processors—and there have been attempts to design compilers to optimize programs written in conventional languages (e.g. "vectorizing" compilers for the FORTRAN language).
Unix tools as visual programming components in a GUI‐builder environment
TLDR
It is described how specially designed reflective components can be used in an industry‐standard visual programming environment to graphically specify sophisticated data transformation pipelines that interact with GUI elements.
The UNIX system: The evolution of the UNIX time-sharing system
  • D. Ritchie
  • Computer Science
    AT&T Bell Lab. Tech. J.
  • 1984
TLDR
A brief history of the early development of the UNIX™ operating system is presented, focusing on the evolution of the file system, the process-control mechanism, and the idea of pipelined commands.
...
1
2
3
4
5
...