• Publications
  • Influence
R U : -) or : -( ? Character- vs. Word-Gram Feature Selection for Sentiment Classification of OSN Corpora
TLDR
This work presents an investigation of the application of the character n-gram model to text classification of corpora from online social networks, the first such documented study, where text is known to be rich in so-called unnatural language, also introducing a novel corpus of Facebook photo comments. Expand
'The First Day of Summer': Parsing Temporal Expressions with Distributed Semantics
TLDR
SUTime, a state-of-the-art NLP system, is extended to incorporate the proposed alternate paradigm: that of distributed temporal semantics—where a probability density function models relative probabilities of the various interpretations. Expand
HarmonicIO: Scalable Data Stream Processing for Scientific Datasets
TLDR
HarmonicIO is presented, a lightweight streaming framework specialized for scientific datasets that boasts a smart dynamic architecture, is highly elastic, and enforces a clear separation between framework components and application execution environment using container technology. Expand
Adapting the Secretary Hiring Problem for Optimal Hot-Cold Tier Placement Under Top-K Workloads
TLDR
An approach for optimal tiered storage allocation under stream processing workloads using top-K queries, which derives expressions for optimal parameter values in terms of tier storage and transport costs a priori, without needing to monitor the application. Expand
Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing
TLDR
A benchmark of stream processing throughput comparing Apache Spark Streaming, with a prototype P2P stream processing framework, HarmonicIO, is presented, suggesting which frameworks and streaming sources are likely to offer good performance for a given load. Expand
Resource- and Message Size-Aware Scheduling of Stream Processing at the Edge with application to Realtime Microscopy
TLDR
This paper investigates scheduling stream processing in hybrid cloud/edge deployment settings with sensitivity to CPU costs and message size, with the aim of maximizing throughput with respect to limited edge resources. Expand
Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit
TLDR
A model that introduces automated, autonomous decision making in data pipelines, such that a stream of data can be partitioned into a tiered or ordered data hierarchy, based on data content rather than a priori metadata is proposed. Expand
Apache Spark Streaming and HarmonicIO: A Performance and Architecture Comparison
TLDR
A performance benchmark comparison between Apache Spark Streaming (ASS) under both file and TCP streaming modes; and HarmonicIO, comparing maximum throughput over a broad domain of message sizes and CPU loads is presented. Expand
Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit
TLDR
The HASTE Toolkit is introduced, a proof-of-concept cloud-native software toolkit based on this pipeline model to partition and prioritize data streams to optimize use of limited computing resources, enabling more efficient data-intensive experiments. Expand
Smart Resource Management for Data Streaming using an Online Bin-packing Strategy
TLDR
A real world use case from large-scale microscopy pipelines is compared and two different strategies of auto-scaling implemented in the HarmonicIO and Spark Streaming frameworks for efficient resource utilization are compared. Expand
...
1
2
...