文章基本信息

标题：Problem-dependent, genetically evolving data mining solutions.
作者：Neukart, Florian ; Moraru, Sorin-Aurel ; Szakacs Simon, Peter 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2011
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Key words: computational intelligence, data mining, genetic algorithms, metaheuristic algorithms
关键词：Algorithms;Data mining;Genetic algorithms;Genetic research;Heuristic programming

Problem-dependent, genetically evolving data mining solutions.

Neukart, Florian ; Moraru, Sorin-Aurel ; Szakacs Simon, Peter 等

Abstract: The already introduced generic Data Mining system SHOCID (System applying High Order Computational Intelligence in Data Mining) (Neukart et al., 2011) applies Computational Intelligence (CI) paradigms for solving any numeric Data Mining problem statement. Within this paper, we introduce the evolutionary approach by which the system is able to decide on its own, which of the possible evolutionary approaches suits best for solving a presented problem statement. Moreover, the system is, by the application of genetic algorithms, able to adapt the architecture and learning method of the Data Mining solution until coming to or at least close to the optimal solution.

Key words: computational intelligence, data mining, genetic algorithms, metaheuristic algorithms

1. INTRODUCTION

The application of CI-paradigms within Data Mining applications still lacks complexity, and furthermore requires the user to understand the underlying algorithms. SHOCID does not require the user to understand how the result of a mining process is achieved. Depending on the problem, the system is able to combine techniques and is, when allowed to, able to decide on its own which strategy suits best. CI substitutes intensive computation for insight into how the system works.

2. CURRENT RESEARCH

Genetic algorithms (GAs) belong to the class of metaheuristic algorithms and are used for solving optimization problems or for training Artificial Neural Networks (ANNs). GAs are adaptive, robust algorithms, and particularly useful for applications conducting search and optimization. GAs are population-based algorithms for which any communication and interaction are carried out within the population and therefore, they possess a high degree of implicit parallelism, especially useful for solving NP-hard problems. Above all, there is no unified explanation of what genetic algorithms exactly are. However, there are certain, in the field generally accepted elements, all descriptions of GAs have in common (Fulcher et al., 2011):

* A population of chromosomes encoding candidate solutions,

* a mechanism for reproduction,

* selection according to fitness, and

* genetic operators.

In ANNs making use of GAs, evolution can be introduced at various levels, starting from weight evolution, going to architecture adaption and finally leading to the evolution of the learning mechanism (Jain et al., 2008; Yao, 1999). Genetic algorithms represent possible solutions to a problem as chromosomes, and the sum of the chromosomes a population. Some chromosomes might represent fairly good solutions, some others not. If a solution is good or bad has to be determined by a so-called fitness function. The fitness function constitutes the measure of fitness of a given individual in the population, which allows to select the individuals that are the best fit (that is, having the highest fitness function), in accordance with the evolutionary principle of the survival of the strongest ones (Rutkowski, 2008). These chromosomes are granted the privilege of recombination. For ANNs, the fitness function represents the error function. Both recombination and mutation are carried out with a probability of a predefined percentage. The probability of recombination is calculated for each chromosome, and the mutation probability for each single gene. At recombination, the selected chromosomes are summarized as pairs, and for each pair of chromosomes a random number between a range is calculated.

3. ALGORITHM FOR EVOLVING DM-SOLUTIONS

The evolutionary search for the solution of a presented DM problem statement is conducted by an evolutionary adaption of each ANN's weights within an ANN committee, combined with the evolutionary adaption of each ANN's architecture, as long as no optimal solution or one close to the optimal one has been found.

3.1 Genetic determination of ANN weights

Several chromosomes, or possible solutions (ANNs) differing in their weights weights, are created and evaluated against the optimal solution during the training phase, which is conducted by the application of the Root Mean Square Error (RMSE):

[X.sub.rmse] = 2[square root of 1/n [[summation].sup.n.sub.i=1][([actual.sub.i] - [desired.sub.i]).sup.2] (1)

3.2 Genetic determination of solution architecture The evolutionary architecture adaption is being achieved by a constructive algorithm, meaning that the algorithm adds complexity to an ANN started with a simple structure (Frean, 1990; Mascioli et al., 1995; Omlin et al., 1993; Stepniewski et al., 1997).

3.3 Evolving committee machines

By our approach, we let SHOCID combine ANNs to committee machines. However, by applying a committee machine the detection of a correct solution is highly probable, but unfortunately the detection of the optimal one is not. Within the overall amount of training data, only a percentage of the overall data is used for training, and the remaining data sets are used for verifying the solution. Due to the verification, it can therefore be determined how accurate the created DM-model solves a presented problem. This is done by applying the RMSE as well. The system requires any presented problem to run through the following algorithm:

Fig. 1. Problem-dependant genetic evolution

1.  Start
2.  Creation of initial committee with [n.sub.c] members and for each
    member a population of [n.sub.p] ANNs with [m.sub.min] hidden
    layers and [n.sub.min] hidden neurons for each hidden layer,
3.  Repeat
    a)  If n < [n.sub.max]:
      i. Increase n by [c.sub.n]
    b)  If n == [n.sub.maz]:
      i. Increase m by [c.sub.m]
    c)  Repeat
      i. Creation of population of x ANNs with [m.sub.cm] hidden layers
         and no, hidden neurons for each hidden layer.
     ii. Randomization of weights and threshold values of each
         chromosome.
    iii. Repeat
          1.  Calculate the network output for the value
              d [member of] D
          2.  Evaluate the fitness of each chromosome:
            1.  Calculate the error [[delta].sub.d[member of]D] for
                each output neuron [0.sub.n]
                [[delta].sub.d[member of]D]
                =([desired.sub.d[member of]D]
                -[actual.sub.d[member of]D])[actual.sub.d[member of]d]
                (1 - [actual.sub.d[member of]D])
            2.  Calculate the error [[delta].sub.hid] for each hidden
                neuron hid [[delta].sub.hid] = [actual.sub.hid]
                (1 - [actual.sub.hid]) [[summation].sub.d[member of]D]
                ([w.sub.hidden][[delta].sub.d[member of]D])
          3.  Selection of chromosomes to recombine
          4.  Repeat
            1.  Crossover of chromosomes
            2.  Mutation of offspring
          5.  Until all selected chromosomes are recombined
     iv.  Until criteria is reached
    d)  Until number of committee members is reached
    c)  Verification
      i.  Present cacht verification data set
     ii.  Increase committee error array for RMSE > [r.sub.max]
          for each verification data set by 1
4.  Until [m.sub.max] is reached
5.  Comparison of stored ANN solutions by RMSE
6.  Present best solution
7.  End

Breakdown:

[n.sub.c]: Number of committee members

[n.sub.p]: Number of chromosomes per committee

[m.sub.min]/[m.sub.max]: Minimum/maximum number of hidden layers

[n.sub.min]/[n.sub.max]: Minimum/maximum number of neurons per [m.sub.n]

[c.sub.n]/[c.sub.m]: Hidden neuron/layer counters d [member of] D: Actual input

[[delta].sub.d[member of]D]: Error for input d [member of] D at output neuron [0.sub.n] in the current generation

[[delta].sub.hid]: Error for hidden neuron

[r.sub.max]: Allowed RMSE

Before applying the algorithm, SHOCID determines the number of input and output neurons according to the presented training data. [n.sub.max] can be determined automatically, based on the number of input neurons. [m.sub.max] will never exceed 2, as with 2 hidden layers ANNs are capable of representing an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy (Heaton Research, 2008). The algorithm shows that the system creates a number of committees no, each of these having n+1 neurons in the actual hidden layer, compared to the last committee. When the system has reached [n.sub.max] for the current hidden layer [m.sub.cur], the second hidden layer, which simultaneously represents [m.sub.max], is created. For [m.sub.max], the same constructive procedure for creating the neurons is adducted, with the exception that for calculating [n.sub.max] the hidden layer [m.sub.max] - 1 is drawn on. During the verification phase, an array is created for each committee, which is increased by 1 when [r.sub.max] has been exceeded for the acutal data set. The best solution is the one with the lowest array value.

4. RESULTS

By the application of problem-dependant, genetic evolution of a DM-solution, the system is able to detect the most accurate solution without user interaction. Due to the conducted tests, the system is able to find solutions of at least the accuracy a solution of manually defined ANNs on a specific problem produces. At least, because humanly defined ANNs are usually built on the adduction of rules like the one for calculating hidden neurons only. This performs well with most of the presented problems, but also the contrary might happen. As the evolving architecture does not only rely on rules, but also verifies the performance of committess consisting of ANNs with more or less hidden neurons and layers, SHOCID also detects such solutions taks them into consideration. Current verification results on thithereto presented training data showed an accuracy between 80 and 90 %. However, the introduced algorithm is restricted to the use of classificative, predictive and regresssive Feed Forward ANNs.

5. FUTURE RESEARCH

The committee machines of SHOCID, as well as the different implementations of evolutionary approaches working together in the mining strategy processes, have been developed according to agent theory. Each of the ANNs is a single program, fulfilling a task in its own environment by the use of its own special skills. The summarization of the single agent's skills empowers them to solve complex Data Mining problems. Future research will target BDI-agents for possibly increasing the further increase of the system's performance (Neukart et al., 2011).

6. CONCLUSION

SHOCID takes up decisions for the user, which we denominate as Decision Intelligence. Recapitulatory, SHOCID is the first DM-system capable of solving most of the presented, numeric DM problems by CI-approaches, regardless to their nature. This is made possible by the system's ability to come to decisions on its own.

8. REFERENCES

Abraham, A. et al. (2008). Engineering Evolutionary Intelligent Systems Berlin Heidelberg: Springer-Verlag, p. 5

Frean M. (1990). The upstart algorithm: a method for constructing and training feed forward neural networks. Neural Comput 2:198-209

Fulcher, J. et al. (2008). Computational Intelligence--A Compendium; Springer; Berlin-Heidelberg

Heaton Research (2008). The number of Hidden Layers [2011-28-09]; URL: http://www.heatonresearch.com/node/707

Jain, L. C. et al. (2008). Computational Intelligence Paradigms Innovative Applications. Berlin Heidelberg: Springer Verlag, p. 4

Karplus W. (1998). cited in: Kaynak O, Zadeh LA, Turksen B, Rudas IJ (eds.) Computational Intelligence: Soft Computing and Fuzzy-Neuro Integration with Applications; Springer-Verlag; Berlin.

Mascioli F. et al. (1995). A constructive algorithm for binary neural networks: the oil spot algorithm. IEEE Trans Neural Netw 6(3):794-797

Neukart F. et al. (2011). High Order Computational Intelligence in Data Mining A generic approach to systemic intelligent Data Mining. IEEE Press: Proceedings of Speech Technology and Human-Computer Dialogue (SpeD), 2011 6th Conference, p. 1-9

Omlin C. W. et al. (1993). Pruning recurrent neural networks for improved generalization performance. Techincal report No 93-6, CS Department, Rensselaer Institute, Troy, NY

Rutkowski, L. (2008). Computational Intelligence Methods and Techniques. Berlin Heidelberg: Springer-Verlag, p. 268

Stepniewski S. W. et al. (1997). Pruning back-propagation neural networks using modern stochastic optimization techniques. Neural Comput Appl 5:76-98

Yao, X. (1999). Evolving neural networks. Proceedings of the IEEE 87(9), 1423-1447