Real-time intelligent big data processing: technology, platform, and applications

@article{Zheng2019RealtimeIB,
  title={Real-time intelligent big data processing: technology, platform, and applications},
  author={Tongya Zheng and Gang Chen and Xinyu Wang and Chun Chen and Xingen Wang and Sihui Luo},
  journal={Science China Information Sciences},
  year={2019},
  volume={62}
}
Human beings keep exploring the physical space using information means. Only recently, with the rapid development of information technologies and the increasing accumulation of data, human beings can learn more about the unknown world with data-driven methods. Given data timeliness, there is a growing awareness of the importance of real-time data. There are two categories of technologies accounting for data processing: batching big data and streaming processing, which have not been integrated… 
Challenges and Solutions for Processing Real-Time Big Data Stream: A Systematic Literature Review
TLDR
This study found that there exists various algorithms for implementing real-time join processing at ETL stage for structured data whereas less work for un-structured data is found in this subject matter.
Intelligent cloud workflow management and scheduling method for big data applications
TLDR
A cloud workflow scheduling strategy based on an intelligent algorithm is proposed and realized and the two-tier scheduling of cloud workflow tasks is realized by adjusting the combination strategy for cloud service resources.
Dynamic multi-variant relational scheme-based intelligent ETL framework for healthcare management
TLDR
An efficient dynamic multi-variant relational intelligent ETL framework has been presented in this article which improves the performance of ETL with least time complexity and higher performance.
Research and Implementation of Distributed Intelligent Processing Architecture
In order to meet the needs of modern development, a distributed intelligent processing architecture that is easy to manage is designed and implemented for the problem of inconvenient management of
Pre‐filtering based summarization for data partitioning in distributed stream processing
TLDR
The results show that the proposed pre‐filtering approach significantly outperforms existing designs in terms of prediction accuracy and achieves a more balanced load as compared to the existing designs.
BP Neural Network-Based Big Data Intelligent Travel Algorithm and Its Application
  • Yan Huang
  • Computer Science
    Scientific Programming
  • 2022
TLDR
A networked intelligent computing platform for big data of urban transportation integrated with various modes of transportation to sense the operation situation of urban comprehensive transportation system in real time, accurately grasp the space-time distribution of urban Transportation supply and demand, and significantly improve the ability of coordinated operation, organization, and management of transportation in large cities.
Distribution-Free One-Pass Learning
TLDR
This paper proposes a simple yet effective approach for distribution-free one-pass learning, without requiring prior knowledge about the change, where every data item can be discarded once scanned.
WavingSketch: An Unbiased and Generic Sketch for Finding Top-k Items in Data Streams
TLDR
This paper proposes a new sketch, WavingSketch, which is much more accurate than existing unbiased algorithms, and shows how it can be applied to four applications: finding top-k frequent items, findingTop-k heavy changes, findingtop-k persistent items, and finding top -k Super-Spreaders.
ColumnSGD: A Column-oriented Framework for Distributed Stochastic Gradient Descent
TLDR
This work proposes a novel, column-oriented scheme (ColumnSGD) that partitions training data by columns rather than by rows, leading to a distributed configuration where individual data and model partitions can be collocated on the same machine.
On-Off Sketch: A Fast and Accurate Sketch on Persistence
TLDR
The space complexity of the On-Off sketch is much better than the state-of-the-art (PIE), and it reduces the error up to 4 orders of magnitude and achieves 2.84 times higher throughput than prior algorithms in experiments.
...
1
2
...

References

SHOWING 1-10 OF 36 REFERENCES
Low latency analytics for streaming traffic data with Apache Spark
TLDR
This work studies the state-of-the-art in distributed and parallel computing, storage, query and ingestion methods, and evaluates tools for periodical and real-time analysis of heterogeneous data, and introduces a Big Data cloud platform with ingestion, analysis, storage and data query APIs.
Performance Evaluation of Yahoo! S4: A First Look
TLDR
An empirical evaluation of one application on Yahoo! S4 focused on the performance in terms of scalability, lost events and fault tolerance and can be helpful towards understanding the challenges in developing stream-based data intensive computing tools and thus providing a guideline for the future development.
Towards a Big Data Analytics Framework for IoT and Smart City Applications
TLDR
This chapter shows how an integrated Big Data analytical framework for Internet of Things and Smart City application could look like and presents an initial version of such a framework mainly addressing the volume and velocity challenge.
Liquid: Unifying Nearline and Offline Big Data Integration
TLDR
Liquid is described, a data integration stack that provides low latency data access to support near real-time in addition to batch applications, and is cost-efficient and highly available.
Wide-Area Spark Streaming: Automated Routing and Batch Sizing
TLDR
This paper presents the design and implementation of an extended Spark Streaming framework to automatically and optimally schedule tasks, select data flow routes and determine micro-batch sizes across geo-distributed datacenters in wide-area networks and proposes a sparsity-regularized ADMM algorithm to efficiently solve a nonconvex optimization problem.
An introduction to Microsoft SQL server StreamInsight
TLDR
This course covers the key concepts in Microsoft StreamInsight and provides developers with a step-by-step guidance to build their first data streaming applications.
Wide-Area Spark Streaming: Automated Routing and Batch Sizing
TLDR
This paper focuses on reducing latencies for spark streaming queries in wide-area networks, by automatically selecting data flow routes and determining micro-batch sizes across geo-distributed datacenters, using a nonconvex optimization problem and an efficient heuristic algorithm based on readily measurable operating traces.
Towards Low-Latency Batched Stream Processing by Pre-Scheduling
TLDR
A pre-scheduling straggler mitigation framework called Lever is presented, which can reduce job completion time by 30.72 to 42.19 percent over Spark Streaming, a widely adopted batched stream processing system and outperforms traditional techniques significantly.
...
1
2
3
4
...