Real-time intelligent big data processing: technology, platform, and applications
@article{Zheng2019RealtimeIB, title={Real-time intelligent big data processing: technology, platform, and applications}, author={Tongya Zheng and Gang Chen and Xinyu Wang and Chun Chen and Xingen Wang and Sihui Luo}, journal={Science China Information Sciences}, year={2019}, volume={62} }
Human beings keep exploring the physical space using information means. Only recently, with the rapid development of information technologies and the increasing accumulation of data, human beings can learn more about the unknown world with data-driven methods. Given data timeliness, there is a growing awareness of the importance of real-time data. There are two categories of technologies accounting for data processing: batching big data and streaming processing, which have not been integrated…
16 Citations
Challenges and Solutions for Processing Real-Time Big Data Stream: A Systematic Literature Review
- Computer ScienceIEEE Access
- 2020
This study found that there exists various algorithms for implementing real-time join processing at ETL stage for structured data whereas less work for un-structured data is found in this subject matter.
Intelligent cloud workflow management and scheduling method for big data applications
- Computer ScienceJournal of Cloud Computing
- 2020
A cloud workflow scheduling strategy based on an intelligent algorithm is proposed and realized and the two-tier scheduling of cloud workflow tasks is realized by adjusting the combination strategy for cloud service resources.
Dynamic multi-variant relational scheme-based intelligent ETL framework for healthcare management
- Computer ScienceSoft computing
- 2022
An efficient dynamic multi-variant relational intelligent ETL framework has been presented in this article which improves the performance of ETL with least time complexity and higher performance.
Research and Implementation of Distributed Intelligent Processing Architecture
- Computer ScienceICGEC
- 2019
In order to meet the needs of modern development, a distributed intelligent processing architecture that is easy to manage is designed and implemented for the problem of inconvenient management of…
Pre‐filtering based summarization for data partitioning in distributed stream processing
- Computer ScienceConcurr. Comput. Pract. Exp.
- 2021
The results show that the proposed pre‐filtering approach significantly outperforms existing designs in terms of prediction accuracy and achieves a more balanced load as compared to the existing designs.
BP Neural Network-Based Big Data Intelligent Travel Algorithm and Its Application
- Computer ScienceScientific Programming
- 2022
A networked intelligent computing platform for big data of urban transportation integrated with various modes of transportation to sense the operation situation of urban comprehensive transportation system in real time, accurately grasp the space-time distribution of urban Transportation supply and demand, and significantly improve the ability of coordinated operation, organization, and management of transportation in large cities.
Distribution-Free One-Pass Learning
- Computer ScienceIEEE Transactions on Knowledge and Data Engineering
- 2021
This paper proposes a simple yet effective approach for distribution-free one-pass learning, without requiring prior knowledge about the change, where every data item can be discarded once scanned.
WavingSketch: An Unbiased and Generic Sketch for Finding Top-k Items in Data Streams
- Computer ScienceKDD
- 2020
This paper proposes a new sketch, WavingSketch, which is much more accurate than existing unbiased algorithms, and shows how it can be applied to four applications: finding top-k frequent items, findingTop-k heavy changes, findingtop-k persistent items, and finding top -k Super-Spreaders.
ColumnSGD: A Column-oriented Framework for Distributed Stochastic Gradient Descent
- Computer Science
- 2020
This work proposes a novel, column-oriented scheme (ColumnSGD) that partitions training data by columns rather than by rows, leading to a distributed configuration where individual data and model partitions can be collocated on the same machine.
On-Off Sketch: A Fast and Accurate Sketch on Persistence
- Computer ScienceProc. VLDB Endow.
- 2020
The space complexity of the On-Off sketch is much better than the state-of-the-art (PIE), and it reduces the error up to 4 orders of magnitude and achieves 2.84 times higher throughput than prior algorithms in experiments.
References
SHOWING 1-10 OF 36 REFERENCES
Low latency analytics for streaming traffic data with Apache Spark
- Computer Science2015 IEEE International Conference on Big Data (Big Data)
- 2015
This work studies the state-of-the-art in distributed and parallel computing, storage, query and ingestion methods, and evaluates tools for periodical and real-time analysis of heterogeneous data, and introduces a Big Data cloud platform with ingestion, analysis, storage and data query APIs.
The rise of "big data" on cloud computing: Review and open research issues
- Computer ScienceInf. Syst.
- 2015
Performance Evaluation of Yahoo! S4: A First Look
- Computer Science2012 Seventh International Conference on P2P, Parallel, Grid, Cloud and Internet Computing
- 2012
An empirical evaluation of one application on Yahoo! S4 focused on the performance in terms of scalability, lost events and fault tolerance and can be helpful towards understanding the challenges in developing stream-based data intensive computing tools and thus providing a guideline for the future development.
Towards a Big Data Analytics Framework for IoT and Smart City Applications
- Computer Science
- 2015
This chapter shows how an integrated Big Data analytical framework for Internet of Things and Smart City application could look like and presents an initial version of such a framework mainly addressing the volume and velocity challenge.
Liquid: Unifying Nearline and Offline Big Data Integration
- Computer ScienceCIDR
- 2015
Liquid is described, a data integration stack that provides low latency data access to support near real-time in addition to batch applications, and is cost-efficient and highly available.
Wide-Area Spark Streaming: Automated Routing and Batch Sizing
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2019
This paper presents the design and implementation of an extended Spark Streaming framework to automatically and optimally schedule tasks, select data flow routes and determine micro-batch sizes across geo-distributed datacenters in wide-area networks and proposes a sparsity-regularized ADMM algorithm to efficiently solve a nonconvex optimization problem.
An introduction to Microsoft SQL server StreamInsight
- Computer ScienceCOM.Geo '10
- 2010
This course covers the key concepts in Microsoft StreamInsight and provides developers with a step-by-step guidance to build their first data streaming applications.
Wide-Area Spark Streaming: Automated Routing and Batch Sizing
- Computer Science2017 IEEE International Conference on Autonomic Computing (ICAC)
- 2017
This paper focuses on reducing latencies for spark streaming queries in wide-area networks, by automatically selecting data flow routes and determining micro-batch sizes across geo-distributed datacenters, using a nonconvex optimization problem and an efficient heuristic algorithm based on readily measurable operating traces.
Towards Low-Latency Batched Stream Processing by Pre-Scheduling
- Computer ScienceIEEE Trans. Parallel Distributed Syst.
- 2019
A pre-scheduling straggler mitigation framework called Lever is presented, which can reduce job completion time by 30.72 to 42.19 percent over Spark Streaming, a widely adopted batched stream processing system and outperforms traditional techniques significantly.