01122014  Case study  Issue 1/2014 Open Access
Comparative study between incremental and ensemble learning on data streams: Case study
 Journal:
 Journal of Big Data > Issue 1/2014
Important notes
Electronic supplementary material
The online version of this article (doi:10.1186/2196111515) contains supplementary material, which is available to authorized users.
Authors’ contributions
WZ and PZ have made substantial contributions to conception and design. WZ has been involved in drafting the manuscript. PZ and CZ revising it critically for important intellectual content; LG has given final approval of the version to be published. All authors read and approved the final manuscript.
Background
We are now entering the era of big data. In government, business and industry domains, big data are generated rapidly and steadily, with a constant growth speed at a magnitude of million records per day. Moreover, these data are often related in temporal and spatial correlations. Typical examples include the wireless sensor data, RFID data and Web traffic data. These data often arrives unboundedly and rapidly, which forms a new class of data called “big stream data”.
The focus on learning from big stream data is how to addressing the concept drifting challenge. Concept drift was first introduced by Wdimer and Kubat [
1], where they noticed that the concept (the classification boundary or clustering centers) continuously changes with time elapsing. Based on the changing speed of concept, we formally divide the concept drifting into loose concept drift and rigorous concept drift [
2]. In the former, concepts in adjacent data chunks are sufficiently close to each other; in the latter, genuine concepts in adjacent data chunks may randomly and rapidly changed.
Advertisement
Incremental learning [
3] and ensemble learning [
4] are two fundamental methods in learning from big stream data with concept drift. Incremental learning follows a machine learning paradigm where the learning process taking place whenever new examples emerge, and then adjusts to what has been learned from the new examples. While the ensemble learning employs multiple base learners and combines their predictions. The fundamental principle of dynamic ensemble learning is to dividing large datastream into small data chunks and training classifiers on each data chunk independently. The most prominent difference of incremental learning from traditional machine learning is that incremental learning does not assume the availability of a sufficient training set before the learning process, but the training example appears over time. Moreover, the biggest difference between incremental learning and ensemble learning is that ensemble learning may discard training data outdated but incremental learning may not.
Although these two types of methods have their own strengths in data streams mining. However, the comparisons between them are rare. A.Tsymbal [
5] described some types of concept drift and related works to handle it. Nevertheless, it not clearly categorizes the incremental and ensemble learning algorithms. In addition, they did no experiments on different learning framework.
In this paper we comparative study the incremental learning and ensemble learning algorithms. In addition, we compare performance between them in both accuracy and efficiency. Furthermore, some suggestions are given for choosing a better classifier.
This paper is organized as follows. In section Incremental learning, review and summarize the incremental learning algorithms. In section Ensemble learning, ensemble learning algorithms are learned and classified. In section Experiment results, incremental learning and ensemble learning algorithms are analysis and compare in a unified standard. The experiment results and discussions are given in section Conclusion and section 6.
Advertisement
Incremental learning
Generally, classification problem is defined as follows. A set of
N training examples of the form (
x,
y) is given, where
y is a discrete class label and
x is a vector of
d attributes (each of which may be symbolic or numeric). The goal is to produce from these examples a model
y =
f(
x) which will predict the classes
y of future examples
x with high accuracy.
To solve this problem, traditional statistic analysis method would load all training data into memory at once. However, compared to the explosive growth of today’s information, the storage capacity is far from desirable. Moreover, when it comes to temporal series traditional data mining algorithms have showed limitations. Incremental learning algorithms are efficient method to these problems.
According to the differences of basic data learning method, incremental learning method can be sorted as there categories: incremental decision tree, incremental Bayesian and incremental SVM. According to the number of new instances to be added in a model at a time, it can be sorted as instancebyinstance learning and blockbyblock learning.
Incremental decision tree
VFDT (very fast decision tree) [
6] and CVFDT (conceptadapting very fast decision tree) [
7] are two classical and impactive algorithms in incremental decision tree algorithms.
VFDT (very fast decision tree) Algorithm was first proposed by Domingos and Hulte in 2000. The author used hoeffding bounds verified that we can use a small sample of the available examples when choosing the split attribute at any given node and the output is asymptotically nearly identical to that of a conventional learner.
According to Hoeffding bounds,
n independent observations of a realvalued random variable
r with range
R, with confidence 1 
δ, the true mean of
r is at least
$\stackrel{\xaf}{r}\epsilon $, where
$\stackrel{\xaf}{r}$ is the observed mean of the samples and
$\epsilon =\sqrt{\frac{{R}^{2}\mathit{\text{ln}}\left(\frac{1}{\delta}\right)}{2n}}$
(1)
Select
G(
X
_{ i }) be the heuristic measure used to choose test attributes. Let
X
_{ a } be the attribute with best heuristic measure and
X
_{ a } be the second best attribute. Let
$\u25b3\stackrel{\xaf}{G}=G\left({X}_{a}\right)G\left({X}_{b}\right)$. Applying the Hoeffding bound to
$\u25b3\stackrel{\xaf}{G}$, if
$\u25b3\stackrel{\xaf}{G}\epsilon $, we can confidently select
X
_{ a } as the split attributes. So VFDT is a realtime system and able to learn from large amount of data within practical time and memory constraints.
But comes to rigorous concept drift, VFDT has its own limitations. In order to solve this problem, Hulten and Spencer proposed CVFDT (conceptadapting very fast decision tree) algorithm [
7] in 2001 based on VFDT. In CVFDT, each internal node has a list of alternate subtrees being considered as replacements for the subtree rooted at the node. It also supports a parameter which limits the total number of alternate trees being grown at any one time. Each node with a nonempty set of alternate subtrees,
l
_{ test }, enters a testing mode to determine if it should be replaced by one of its alternate subtrees.
l
_{ test } collects the next m training examples that arrives to compare the accuracy of the subtree it roots with the accuracies of all of its alternative subtrees. If the most accurate alternate subtree is more accurate than the
l
_{ test },
l
_{ test } is replaced by the alternate. CVFDT also prunes alternate subtrees during the test phase. For each alternative subtree of
l
_{ test },
${l}_{\mathit{\text{all}}}^{i}$, CVFDT remembers the smallest accuracy difference ever achieved between the two,
$\u25b3\mathit{\text{min}}\left({l}_{\mathit{\text{test}}},{l}_{\mathit{\text{all}}}^{i}\right)$. CVFDT prunes any alternate whose current test phase accuracy difference is at least
$\u25b3\mathit{\text{min}}\left({l}_{\mathit{\text{test}}},{l}_{\mathit{\text{all}}}^{i}\right)+1\%$. By this means of subtree, CVFDT can adapt itself to concept drift well than VFDT.
In summary these two algorithms are both realtime method for datastream mining. CVFDT is faster than VFDT and also adapts better to concept drift. While VFDT cost less memory than CVFDT.
Incremental Bayesian algorithm
Besides the advantages such as feasible, accurate and fast shared by all incremental learning algorithms, Incremental Bayesian Algorithms [
8‐
10] can handle training instances without labels. Generally speaking, Bayesian Algorithm implement incremental learning by constantly updating the priori probability according to incoming training instants. As it illustrates in Figures
1 and
2.
×
×
In Bayesian algorithm priori probability
P(
θ
S,
I
_{0}) is a known quantity. While in the incremental Bayesian the priori probability change into
P(
θ
S,
I
_{0}) considering incoming new training instances. What we are concerning with is how to update a priori probability incrementally.
Firstly, make the following stipulation to some marks. The sample’s space
S is composed of attribute space
I and class space
C. Which is denote
S = {
S
_{1},
S  2, …,
§
_{ n }} = <
I,
C >. Each sample
S
_{ i } = {
a
_{1},
a
_{2}, …,
a
_{ m },
c
_{ l }}, the attribute is denoted by
A
_{ i }, whose value is {
a
_{ ik }}, and class attribute
C is composed by
I discrete values (
c
_{1},
c
_{2}, …,
c
_{ l }). The task of classifier is to learning the attribute space
I and class space
C, then finding out the mapping relation between them. Only one
c
_{ i } in class attribute set
C = (
c
_{1},
c
_{2}, …,
c
_{ l }) will be found to correspond given any one sample
s
_{ i } = {
a
_{1},
a
_{2}, …,
a
_{ m }} ∈
I. That is to say existing only one
c
_{ i } for each instance
x = (
a
_{1},
a
_{2}, …,
a
_{ m })∈
I, let
P(
c =
c
_{ i }
x) ≥ (
j = 1, 2, …,
l).
For the training samples
D = {
x
_{1},
x
_{2}, …,
x
_{ n }}, assume that the priori probability follows dirichlet distribution. We can estimate the parameters as follows.
${\theta}_{\mathit{\text{ik}}r}=P\left({A}_{\mathit{\text{ik}}}{c}_{r};\theta \right)=\frac{1+\mathit{\text{count}}({A}_{\mathit{\text{ik}}}\wedge {c}_{r})}{\left{A}_{i}\right+\mathit{\text{count}}\left({c}_{r}\right)}$
(2)
${\theta}_{r}=P\left({c}_{r}\theta \right)=\frac{1+\mathit{\text{count}}\left({c}_{r}\right)}{\leftC\right+\leftD\right}$
(3)
Where
A
_{ ik } is the
k
_{ th } value of attribute
A
_{ i }, 
A
_{ i } is the number of values in attribute
A
_{ i }. 
D is the size of training samples.
According to incoming instances
T = {
x 1,,
x 2,, … ,
x
m,}, we consider two different situations: labeled instances and unlabeled instances. For labeled instances, we can update the parameters as follows:
${\theta}_{\mathit{\text{ik}}r}^{,}=P\left({A}_{\mathit{\text{ik}}}{c}_{r};{\theta}^{,}\right)$
$=\frac{1+\mathit{\text{count}}\left({A}_{\mathit{\text{ik}}}\wedge {c}_{r}\right)+\mathit{\text{coun}}{t}^{,}\left({A}_{\mathit{\text{ik}}}\wedge {c}_{r}\right)}{\left{A}_{i}\right+\mathit{\text{count}}\left({c}_{r}\right)+\mathit{\text{coun}}{t}^{,}\left({c}_{r}\right)}$
(4)
${\theta}_{r}^{,}=P\left({c}_{r}{\theta}^{,}\right)=\frac{1+\mathit{\text{count}}\left({c}_{r}\right)+\mathit{\text{coun}}{t}^{,}\left({c}_{r}\right)}{\leftC\right+\leftD\right+\left{D}^{,}\right}$
(5)
For the unlabeled instances, we can update the parameters as follows:
${\theta}_{r}^{,}=\left\{\begin{array}{cc}P\left({c}_{r}{\theta}^{,}\right)=\frac{\delta}{1+\delta}{\theta}_{r}& {c}_{r}\ne {c}_{p}^{,}\\ P\left({c}_{r}{\theta}^{,}\right)=\frac{\delta}{1+\delta}{\theta}_{r}+\frac{\delta}{1+\delta}& {c}_{r}={c}_{p}^{,}\end{array}\right.$
(6)
Where
δ = 
C + 
D
${\theta}_{\mathit{\text{ik}}r}^{,}=\left\{\begin{array}{cc}\hfill P\left({A}_{\mathit{\text{ik}}}{c}_{r};{\theta}^{,}\right)=\frac{\delta}{1+\delta}{\theta}_{\mathit{\text{ik}}r}& {c}_{r}={c}_{p}^{,}\wedge {A}_{\mathit{\text{ik}}}\ne {A}_{\mathit{\text{ip}}}^{,}\hfill \\ \hfill P\left({A}_{\mathit{\text{ik}}}{c}_{r};{\theta}^{,}\right)=\frac{\delta}{1+\delta}{\theta}_{\mathit{\text{ik}}r}+\frac{\delta}{1+\delta}& {c}_{r}\ne {c}_{p}^{,}\wedge {A}_{\mathit{\text{ik}}}\ne {A}_{\mathit{\text{ip}}}^{,}\hfill \\ \hfill {\theta}_{\mathit{\text{ik}}r}& {c}_{r}\ne {c}_{p}^{,}\hfill \end{array}\right.$
(7)
In summary, Bayesian Algorithm itself has incremental property. For the incoming training instances with labels, it is easy to complement an incremental algorithm. Otherwise, with instances without labels, we discusses the sampling policy and various classifying loss expressions to simplifies and improves the classifiers.
Incremental SVM
The two core concepts of SVM algorithm are mapping input vectors into a high dimensional feature space and structural risk minimization. There is a useful property in SVM algorithm: classification equivalence on SV set and the whole training set. Based on this property, Incremental SVM [
11‐
18] can be trained by preserving only the SVs at each step, and add them to the training set for the next step. According to different situations, there are different ways to select training set at each step.
The problems discussed in Incremental SVM algorithm are how to discarding history samples optimally and how to selecting new training instances in successive learning procedure. But there is still some intrinsic difficulties. Firstly, Support vectors (SVs) is highly depended on kernel functions you selected. Secondly, when concept drift happens, previous support vectors could be useless.
Decision tree algorithms, Bayesian learning algorithms and SVM algorithms are three main algorithms in data mining. The problem we discussed in incremental algorithm is how to using old training result accelerating the successive learning procedure. Incremental decision tree (hoffding tree or VFDT) uses a statistic result (hoffding bounds) to guaranteeing that we can learn from abundant data within practical time and memory constraints. Incremental Bayesian algorithm updates the prior probability dynamically according to the incoming instances. Incremental SVM is based on the classification equivalence of SV set and the whole training set. So we can add only support vectors (SVs) to the incoming training set for incrementally training a new model. In these three algorithms, Incremental decision tree and Incremental Bayesian algorithms are based on experience risk minimization. While Incremental SVM is based on structural risk minimization. Incremental decision tree and Incremental Bayesian algorithm is faster and Incremental SVM algorithm has better a generalization ability.
All of these algorithms above update a classifier dynamically using the new coming data. On one hand, we need not to load all data into memory at once. On the other hand, we can realtime modify the classification model according to the new training instances. Moreover, the classifier can adapt to concept drift via realtime updating to new data. However, there are still shortcomings and limitations in incremental learning algorithms. For example, it can only unceasing absorb new datastreams, it cannot remove old instances in the classification model. Because of these shortcomings, incremental algorithms will be helpless when comes to rigorous concept drift.
Ensemble learning
The fundamental principle of dynamic ensemble learning is to dividing large datastream into small data chunks. Then training classifiers on each data chunk independently. Finally, it develops heuristic rules to organize these classifiers into one super classifier.
This structure has many advantages. Firstly, each data chunk is relatively small so that the cost of training a classifier on it is not high. Secondly, we saved a well trained classifier instead of the whole instances in the data chunk which cost much less memory. Thirdly, it can adapt to various concept drifts via different weighing policies. So the dynamic ensemble learning models can cope with both unlimited increasing amounts of data and concept drift problems in datastream mining.
There are many heuristic algorithms for ensemble learning. According to the ways of forming the base classifiers, it can be roughly divided into two classes: horizontal ensemble framework and vertical ensemble framework.
Horizontal ensemble framework
Horizontal ensemble framework tends to selecting the same type of classifiers and train them independently on different datachunks, then using a heuristic algorithm to organize them together. It can be illustrated in Figure
3.
×
In this framework, almost all researches develop center on three issues: weighting policy, data selection and the choice of base classifiers. It can be formulized as:
${f}_{\mathit{\text{HE}}}={\Sigma}_{i=1}^{N}{\alpha}_{i}{f}_{i}\left(x\right)$
(8)
Where
α
_{ i } is the weighting value assigned to the
i
_{ th } datachunk.
f
_{ i }(
x) is the classifier trained on the
i
_{ th } datachunk. And the 1
t
oN is the datachunks selected.
Weighting policy is the most important method in ensemble learning to guarantee accuracy. Street [
19] proposed a SEA algorithm, which combined all the decision tree models using majorityvoting. In this algorithm
${\alpha}_{i}=\frac{1}{N}\left(i=1,2,\dots ,N\right)$. Kolter [
20] also proposed a Dynamic Weighted Majority (DWM) algorithm. Yeon [
21] proved majorityvoting is the optimum solution in the case of no concept drift. In order to tracing the concept drift, Wang [
22] proposed an accuracyweighted ensemble algorithm, in which they assign each classifier a weight reversely proportional to the classifier’s accuracy on the uptodata chunk. In this algorithm
α
_{ i } =  (
MSE
_{ i } 
MSE
_{ r }), where
${\mathit{\text{MSE}}}_{i}=\frac{1}{\left{S}_{n}\right}{\Sigma}_{(x,c)\in {S}_{n}}{\left(1{f}_{c}^{i}\left(x\right)\right)}^{2}$ is the mean square error of
f
_{ i }(
x).
S
_{ n } is the training set.
MSE
_{ r } =
Σ
_{ c }
p(
c)(1 
p(
c))
^{2} is the mean square error of a random classifier.
C is the labels of all instances. Tsymbal [
5] proposed a dynamic integration of classifiers in which base classifier is given a weight proportional to its local accuracy. Zhang [
23] develop a kernel mean matching (KMM) method to minimize the discrepancy of the data chunks in the kernel space for smooth concept drift and an Optimal Weight values for classifiers trained from the most recent data chunk for abrupt concept drift. Yeon [
21] proposed an ensemble model has a form of a weighted average and ridge regression combiner. In this proposed algorithm a angle between the estimated weights and optimal weight is used to estimate concept drift, when concept drift is smooth
${\alpha}_{i}=\frac{1}{N}\left(i=1,2,\dots ,N\right)$ otherwise
${\alpha}_{i}={\mathit{\text{arg}}}_{w}\mathit{\text{min}}{\Sigma}_{i=1}^{n}{\left({y}_{i}{\Sigma}_{j=1}^{m}{\alpha}_{j}{f}_{j}\left({x}_{i}\right)\right)}^{2}+\lambda {\Sigma}_{j=1}^{m}{\alpha}_{j}^{2}$ subject to
${\Sigma}_{j=1}^{m}{\alpha}_{j}=1,{\alpha}_{j}0$ where
y
_{ i } is the label of instance. m is the number of classifiers and n is the number of instances. In this algorithm a penalty coefficient is employed to trace different level of concept drift.
As to instance selection, weighted instance and data discarded policy et al. are discussed. Fan [
24] proposed a benefitbased greedy approach which can safely remove more than 90% of the base models and guarantee the acceptable accuracy. Fan [
25] proposed a simple, efficient and accurate crossvalidation decision tree ensemble method to discard old data and combine with new data to construct the optimal model for evolving concept. Zhao [
26] proposed a pruning method (PMEP) to obtain the ensembles at a proper size. Lu [
27] proposed a heuristic metric that considers the tradeoff in accuracy and diversity to select the top p percent of ensemble members, depending on their resource availability and tolerable waiting time. Kuncheva [
28] proposed a concept of “forgetting” by ageing at a variable rate.
Vertical ensemble framework
Vertical ensemble framework tends to selecting different type of classifiers and training it independently on the uptodata datachunk. Then it uses a heuristic algorithm to organizing them together. This algorithm often uses in a situation of rigorous concept drift, with little or no correlation of the decision concepts between data chunk. It can be illustrated in Figure
4.
×
In this frame work, we focus more on classifier diversity and a suited weighting policy. It can be formulized as:
${f}_{\mathit{\text{VE}}}^{n}\left(x\right)={\Sigma}_{i=1}^{m}{\beta}_{i}{f}_{\mathit{\text{in}}}\left(x\right)$
(9)
Where
β
_{ i } is the weighting value assigned to the
i
_{ th } classifier. And
f
_{ in }(
x) is the
i
_{ th } classifier trained on the
n
_{ th } datachunk.
In vertical ensemble framework, classifier diversity is a primary factor to guarantee accuracy. Zhang [
29] proposed a semisupervised ensemble method:
U
_{ D }
EED. It works by maximizing accuracies of base learners on labeled data while maximizing diversity among them on unlabeled data. Zhang [
2] proposed an Optimal Weight values for classifiers in the case of abrupt concept drift, in this algorithm all classifiers using different learning algorithms, e.g., Decision Tree, SVM, LR, and then builds prediction models on and only on the uptodata data chunk. Minku [
30] show that low diverse ensemble obtain low error in the case of smooth concept drift while high diverse ensemble is better when abrupt concept drift happens.
The weighting policy in horizontal framework is almost commonly used in the vertical framework. It is also method like voting majority, weighted based on accuracy and weighted through a regression algorithm and so on.
Horizontal ensemble frame work building classifiers on different datachunks, in this way it robust to noisy stream and concept drift because the final decisions are based on the classifiers trained from different chunks. Even if noisy data chunks and concept drift may deteriorate some base classifiers, the ensemble can still maintain relatively stable prediction accuracy. While vertical ensemble framework building classifiers using different learning algorithms on the same datachunk, in this way it can decrease the expected bias error compared to any single classifiers. When we have no prior knowledge on the incoming data, it is difficult to determine which type of classifier is better, so combining multiple types of classifiers is likely to be a better solution than simply choosing either of them. We can also aggregate these two frameworks together. We can combine these base classifiers to form an aggregate ensemble through model defined in Eq.
10.
${f}_{\mathit{\text{AB}}}={\Sigma}_{i=1}^{n}{\Sigma}_{j=1}^{m}{\alpha}_{i}{\beta}_{j}{f}_{\mathit{\text{ij}}}\left(x\right)$
(10)
In a word, the core idea of ensemble learning is to organizing different weak classifiers into one strong classifier. The main method used in ensemble learning is divideandconquer. In ensemble learning large datastream is divided into small datachunks, and we train classifiers on each chunk independently. The difficult problems we discussed mostly in ensemble learning are as follows. First, what base classifier should we choose? Second, how to set the size of a datachunk? Third, how to assign weighting values to different classifiers? Finally, how to discard previous data? As to setting the size of a datachunk, large datachunk is more robust while small datachunk adapts better to concept drift. And the weighting policy direct influence on accuracy.
Experiment results
The aim of the experiments is to comparing the incremental learning with the ensemble learning algorithms. In incremental learning algorithms incremental decision tree (include VFDT and CVFDT), incremental Bayesian algorithm and incremental SVM were experimental verified. In ensemble learning algorithms horizontal framework and vertical ensemble framework were implemented. AWE was chosen to represent horizontal ensemble framework. In all the compared algorithms we compare basic characteristics on popular synthetic and real life data sets.
All of the tested algorithms were implemented in Java as part of the MOA and Weka framework. We implemented the AWE algorithms and implement incremental SVM in Libsvm, while all the other algorithms were already a part of MOA or Weka. The experiments were done on a machine equipped with an AMD Athlon (tm) II X3 435 @2.89 GHz Processor and 3.25 GB of RAM. To make the experimental more reliable, we experiment every algorithm on each data stream (from different starting point) for 10 times and calculated the mean and variance based on these values in the experimental. Ttest was used for Significance Testing. Classification accuracy was calculated using the data block evaluation method, which works similarly to testthentrain paradigm. This method reads incoming examples without processing them, until they form a data block of size d. Each new data block is first used to test the existing classifier, and then it updates the classifier.
Synthetic and real data streams in experiment
In this part all five datastreams used in the experiment will be listed. There are four synthetic datastreams (Hyperplane 1, Hyperplane 2, Hyperplane 3 and KDDcup99) and one real datastreams (sensor datastream).
In these datastreams Hyperplane 1, Hyperplane 2 and Hyperplane 3 are generated by Hyperplane generator in moa. They all have 9 attributes and one label with 2 classes, and there are 800,000 instances in each of the datastreams. The difference between these three synthetic datastreams is that they have different level of concept drifts. Hyperplane 1 has no concept drift. Hyperplane 2 has median level of concept drift and Hyper plane 3 has abrupt concept drift. Kddcup99 stream was collected from the KDD CUP challenge in 1999, and the task is to build predictive models capable of distinguishing between intrusions and normal connections. Clearly, the instances in the stream do not flow in similar way as the genuine stream data. In this datastream each instance has 41 attributes and one label with 23 classes. Sensor stream contains information (temperature, humidity, light, and sensor voltage) collected from 54 sensors deployed in Intel Berkeley Research Lab. The whole stream contains consecutive information recorded over a 2 months period (1 reading per 13 minutes). sensor ID is used as the class label, so the learning task of the stream is to correctly identify the sensor ID (1 out of 54 sensors) purely based on the sensor data and the corresponding recording time. While the data stream flow over time, so does the concepts underlying the stream. For example, the lighting during the working hours is generally stronger than the night, and the temperature of specific sensors (conference room) may regularly rise during the meetings. So there are 5 attributes and a label of 54 classes in this datastream.
In order to make a visual representation of the concept drift, we divided these datastreams into small datachunks. And then, we train a C4.5 decision tree on the first datachunk. Next, we use this classifier predict the labels of the following datachunks and record the accuracy. If there is no concept drift, the accuracies will be stable. Otherwise the accuracies will changes dramatically.As it show in Figure
5, we can see that KDDcup99 and HGstream1 data streams have no concept drift. The sensor data stream has the most rigorous concept drift. HGstream 2 and HGstream 3 have median level of concept drift. While HGstream2 has a relatively rigorous concept drift and HGtream3 has a relatively loose concpet dirft.
×
Competitive study
Incremental learning and ensemble learning are two major solutions to largescale data and concept drift in big stream data mining. Incremental learning is a style of learning where the learner updates its model of the environment where a new significant experience becomes available. And ensemble learning adopts a divideandconquer method to organize different base classifiers into one super classifier. They both can handle infinitely increasing amount of data and time series. Moreover, they both meet the realtime demands. Besides the above advantages they shared together each algorithm has its own relative merits. It will be discussed in detail in the followings.
Competitive study on accuracy and efficiency
Incremental learning and ensemble learning are two major solutions to largescale data and concept drift in today’s data stream mining. Incremental learning is a style of learning where the learner updates its model of the environment where a new significant experience becomes available. And ensemble learning adopts a divideandconquer method to organize different base classifiers into one super classifier. They both can handle infinitely increasing amount of data and time series. Moreover, they both meet the realtime demands. Besides the above advantages they shared together each algorithm has its own relative merits. It will be discussed in detail in the followings.
Competitive study on various concept drift
Incremental algorithm cannot adapt well to sudden concept drift. That is because almost of the incremental algorithms update its model according to incoming datastreams but it never discard history knowledge. For examples, in incremental Bayesian algorithms, priori probability is updated smoothly according to incoming instances. In incremental SVM algorithms, support vectors (SVs) are directly related to decision plane and kernel function. So it is very sensitive to concept drift. Only CVFDT an incremental decision algorithm can process timechanging concept by growing an alternative subtree. But it costs additional space to save alternative paths which decrease its efficiency dramatically.
In compared with incremental algorithms, ensemble learning algorithms is more flexible to concept drift. Firstly, it can set the size of data chunk to fit different level of concept drift: small data chunk for sudden concept drift and large data chunk for smooth concept drift. Secondly, it can assign different weighting values to different base classifiers to satisfy various concept drift. Thirdly different policy to select and discard base classifiers also helped.
As a result, ensemble learning algorithms adapt much better to concept drift than incremental learning algorithms.
Generally speaking, incremental algorithms is faster and has better antinoise capacity than ensemble algorithms. While ensemble algorithms is more flexible and adapt itself better to concept drift. Moreover, incremental algorithms has more restrictions than ensemble algorithms. Not all classification algorithms can be used in incremental learning, but almost every classification algorithms can be used in an ensemble algorithms.
Therefore, when there is no concept drift or concept drift is smooth, an incremental algorithm is recommended. While huge concept drift or abrupt concept drift exist, ensemble algorithms are recommended to guarantee accuracy. Otherwise, in case of relatively simple datastream or a high level of realtime processing is demanded incremental learning is a better choice. And in case of complicated or unknown distribution datastream ensemble learning is a better choice.
Experiments on incremental algorithms
In this part we will experiment on different incremental algorithms.
Table
1 shows the accuracy of four kinds of incremental learning algorithms: VFDT (very fast decision tree), CVFDT (conceptadapting very fast decision tree), incremental bayesian algorithm and incremental SVM.We can see that the accuracy is decreased as the concept drift increase. CVFDT relatively adapts better to concept drift, but we can see in Table
1 when there is a large number of attributes in the data sets (KDDcup99) CVFDT can not work properly. In a word, majority of incremental algorithms can meet the requirement of realtime processing but not adapt well to abrupt concept drift.
Table 1
The accuracy of four kinds of incremental learning algorithms
Hyperplane 1

Hyperplane 2

Hyperplane 3

Sensor

KDDcup99



VFDT

90.42
_{±0.13}

78.93
_{±1.7}

82.73
_{±3.08}

92.21
_{±2.19}

99.69
_{±0.01}

CVFDT

90.44
_{±0.14}

80.22
_{±1.55}

84.51
_{±2.68}

92.22
_{±2.12}


Incremental Bayesian

93.8
_{±0.001}

73.54
_{±2.89}

81.65
_{±0.023}

93.29
_{±0.22}

98.50
_{±0.002}

Incremental SVM

90.8
_{±0.14}

70.5
_{±3.79}

80.12
_{±2.96}

91.89
_{±1.21}

97.96
_{±0.97}

Experiments on ensemble algorithms
In this section, horizontal ensemble framework is firstly discussed. And then we talked about vertical ensemble framework. In the end we compared these two ensemble frameworks.
Table
2, Table
3 and Table
4 show the relations between the size of data chunk, number of classifiers and classification accuracy on different concept drift. two kinds of most popular and most representative horizontal framework ensemble algorithms are tested. In all ensemble classifiers decision tree is selected as base classifier. We can point out that in a smooth concept drift we tend to select a relatively large data chunk and small size of classifiers while in the case of abrupt concept drift a small data chunk is better.
Table 2
The mean of the accuracy on data stream Hyperplane1
Datachunk size

500

1000

2000



Classifier number

Algorithm


10

SEA

83.66
_{±2.49}

84.23
_{±2.01}

86.39
_{±0.35}

AWE

92.88
_{±0.11}

93.34
_{±0.05}

93.61
_{±0.10}


20

SEA

85.82
_{±3.14}

87.06
_{±0.53}

87.49
_{±0.79}

AWE

93.33
_{±0.12}

93.63
_{±0.06}

93.79
_{±0.09}


30

SEA

84.23
_{±2.01}

86.94
_{±1.62}

88.14
_{±0.27}

AWE

93.49
_{±0.10}

93.70
_{±0.06}

93.85
_{±0.08}

Table 3
The mean of the accuracy on data stream Hyperplane2
Datachunk size

500

1000

2000



Classifier number

Algorithm


10

SEA

77.06
_{±0.97}

87.91
_{±1.97}

87.36
_{±0.78}

AWE

84.94
_{±3.87}

85.72
_{±0.53}

89.09
_{±.012}


20

SEA

86.32
_{±1.21}

87.17
_{±0.99}

91.15
_{±0.29}

AWE

87.96
_{±0.52}

89.27
_{±0.96}

91.42
_{±0.26}


30

SEA

89.22
_{±1.56}

88.49
_{±0.5}

90.44
_{±0.08}

AWE

90.36
_{±0.43}

89.44
_{±0.46}

90.39
_{±0.04}

Table 4
The mean of the accuracy on data stream Hyperplane3
Datachunk size

500

1000

2000



Classifier number

Algorithm


10

SEA

77.08
_{±9.59}

78.44
_{±11.07}

77.42
_{±4.03}

AWE

88.04
_{±2.52}

86.99
_{±2.97}

85.13
_{±2.20}


20

SEA

79.4
_{±18.41}

78.58
_{±5.78}

75.8
_{±4.98}

AWE

88.26
_{±2.61}

87.17
_{±3.12}

85.53
_{±2.04}


30

SEA

79.26
_{±6.22}

78.82
_{±8.45}

75.44
_{±1.84}

AWE

88.75
_{±2.36}

87.86
_{±2.70}

86.44
_{±1.84}

In these tables we can see that AWE algorithm is better than SEA algorithm, especially in case of concept drift (HGstream3). Moreover, we can see that different weighting policy directly lead to different accuracy on test instances. And many papers about ensemble algorithms are discussed on different weighting policies. Beside weighting policy, data chunk size and classifier number are other two influential factors on the performance of ensemble algorithms.Figure
6 shows the algorithm accuracy on different number of classifiers. We can see that in a case of little or no concept drift more classifiers is better. While in a case of abrupt concept drift less classifiers is better. But the difference of this influential factor is not that obvious.Figure
7 shows the algorithm accuracy on different size of data chunk. We can see that in a case of little or no concept drift a large data chunk is better and in a case of concept drift small data chunk is better. The influence of data chunk size is obvious. In both Figure
6 and Figure
7 we can see that data stream with less concept drift have better performance.Figure
8 shows the algorithm variance on different number of classifiers. We can see that the influence of classifier numbers is not that obvious. Only in the case of high level concept drift, we can see that more base classifiers more stable.Figure
9 shows the algorithm variance on different size of data chunk. We can see that the bigger the data chunk is the more table the algorithm performance is. And this tendency is very obvious. In both Figure
8 and Figure
9 we can see that AWE algorithm is more stable than SEA algorithm. And less concept drift directly lead to better performance. Meanwhile the influence on variance is more obvious than that on accuracy.As show in Figure
10, we can see that besides in data stream KDDcup99, in data stream sensor vertical ensemble has better performance than in other datastreams. KDDcup99 is a datastream every classify algorithm will achieve a outstanding result and sensor is a datastream with the highest level of concept drift. That is to say in a case of large concept drift vertical ensemble algorithm is a better choice. And generally in vertical ensemble algorithms we tend to selected a data chunk not larger than 1000 instances. Moreover a data chunk less than 500 instances is not stable and can not achieve a good performance.
×
×
×
×
×
As we show in Table
5 we can see that horizontal ensemble framework do a better job when concept drift is relatively smooth while the vertical do a better job in the case of abrupt concept drift. we also can regard the vertical ensemble framework as an extreme case of horizontal ensemble framework where only one base classifier is selected and trained on the latest data chunk.
Table 5
Competitive study on accuracy
Hyperplane 1

Hyperplane 2

Hyperplane 3

Sensor

KDDcup99



Average vote horizontal ensemble

88.14
_{±0.27}

91.15
_{±0.29}

78.82
_{±8.45}

89.05
_{±3.85}

99.44
_{±0.01}

Accuracy based horizontal ensemble

93.85
_{±0.08}

91.42
_{±0.26}

88.75
_{±2.36}

87.34
_{±2.17}

99.31
_{±0.04}

Average vote vertical ensemble

92.4
_{±0.01}

82.5
_{±0.07}

83.9
_{±0.04}

95
_{±0.02}

98.6
_{±0.01}

Accuracy based vertical ensemble

93.6
_{±0.01}

84.3
_{±0.06}

86.6
_{±0.04}

94
_{±0.05}

98.4
_{±0.01}

Experiments on competitive learning
In this section, we will competitively discussed the advantages and disadvantages between incremental and ensemble algorithms. Consider the comparability, in each algorithm we selected decision tree as a base classifier. So we chose VFDT as representative of incremental algorithm and we used a accuracy based weighting algorithm in both horizontal ensemble and vertical ensemble algorithms.
Table
6 shows that ensemble algorithm is more accuracy than incremental algorithm. And in a case of high level concept drift, vertical ensemble algorithm has better performance while in smooth concept drift or no concept drift horizontal ensemble algorithm is better. But in a case when a single classifier also can perform very well in classification the ensemble learning algorithm is not as good as incremental learning algorithms.
Table 6
The accuracy of different algorithms
Hyperplane 1

Hyperplane 2

Hyperplane 3

Sensor

KDDcup99



VFDT

90.42
_{±0.13}

78.93
_{±0.17}

82.73
_{±3.08}

92.21
_{±2.19}

99.69
_{±0.01}

Horizontal ensemble

93.85
_{±0.08}

86.28
_{±0.74}

88.75
_{±2.63}

87.34
_{±2.17}

99.31
_{±0.043}

Vertical ensemble

93.9
_{±0.013}

84.3
_{±0.67}

86.6
_{±0.04}

95.1
_{±0.35}

98.4
_{±0.005}

Table
7 shows the cost time of different algorithms. In all these datastreams Hyperplane 1, Hyperplane 2 and Hyperplane 3 contains 300,000 instances and sensor contains 10,000 instances. KDDcup99 contains 100,000 instances. We can see that incremental algorithm is obvious faster than ensemble algorithms. While horizontal ensemble and vertical ensemble algorithms has a similar cost time.
Table 7
Cost time of different algorithms
Hyperplane 1

Hyperplane 2

Hyperplane 3

Sensor

KDDcup99



VFDT

3254.9

3310.6

3246.9

203.2

3479

Horizontal ensemble

67760

67913

64237

5426

348114

Vertical ensemble

63145

59876

62110

5897

30146

In a word, we can say that ensemble learning is more accuracy than incremental learning algorithms and incremental algorithm is more efficiency than ensemble algorithms.
Conclusion
Unlimited growth of big stream data and concept drift has been two most difficult problems in datastream mining. There are two mainstream solutions to these problems: incremental learning and ensemble learning algorithms. Incremental learning algorithms employ a method of updating a single model by incorporating newly arrived data. While ensemble learning algorithms use the divideandconquer method to cutting up large data into small data chunks and training classifiers on each data chunk independently, then a heuristic algorithm is used to ensemble these classifiers together. In incremental algorithms, we talked mostly about how to recording previous knowledge and adapting to new knowledge. In ensemble learning algorithms, we discuss mostly about how to making a weighting policy for each base classifiers.
Both of these algorithms can handle big stream data and concept drift problems, and each of them has its own properties. Incremental learning algorithms have better performance on efficiency and ensemble learning adapts better to concept drift. Moreover, ensemble learning algorithms are more stable than incremental algorithms. The size of data chunk is another important factor in ensemble algorithms, which influences the algorithm performance. Generally, a better way to achieving high accuracy is that the higher levels a concept drift is the smaller a data chunk will be.
Therefore, in a case of loose concept drift or no concept drift an incremental algorithm is recommended and in a case of rigorous concept drift an ensemble algorithm is a better choice. Otherwise, when efficiency is first considering factor we tend to selecting incremental algorithm and when accuracy is the most important factor we choose an ensemble algorithm. We can employ different algorithms according to the real datastream distributions.
Weighting policy, instances selection, classifier diversity and so on is the main rules discussed in previously researches. With the very fast development of information industry, we have to face the reality of information explosion. In that situation, more and more classifiers will be trained and realtime processing will become a challenge. Therefore, the next step is how to effectively managing large amount of classifiers. We can consider some pruning method or index technology on the classifiers. We can also consider some parallel algorithms to organizing the classifiers.
Acknowledgments
This work was supported by the NSFC (No. 61370025), 863 projects (No.2011AA01A103 and 2012AA012502), 973 project (No. 2013CB329605 and 2013CB329606), and the Strategic Leading Science and Technology Projects of Chinese Academy of Sciences (No.XDA06030200).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 2.0 International License (
https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Authors’ contributions
WZ and PZ have made substantial contributions to conception and design. WZ has been involved in drafting the manuscript. PZ and CZ revising it critically for important intellectual content; LG has given final approval of the version to be published. All authors read and approved the final manuscript.