• Overview of Chinese core journals
  • Chinese Science Citation Database(CSCD)
  • Chinese Scientific and Technological Paper and Citation Database (CSTPCD)
  • China National Knowledge Infrastructure(CNKI)
  • Chinese Science Abstracts Database(CSAD)
  • JST China
  • SCOPUS
LU Xu, CHEN Yi-hong, XIONG Zhang-rui, LIAO Bin-yu. A fast parallel decision tree algorithm for big data analysis[J]. Journal of Yunnan University: Natural Sciences Edition, 2020, 42(2): 244-251. DOI: 10.7540/j.ynu.20190502
Citation: LU Xu, CHEN Yi-hong, XIONG Zhang-rui, LIAO Bin-yu. A fast parallel decision tree algorithm for big data analysis[J]. Journal of Yunnan University: Natural Sciences Edition, 2020, 42(2): 244-251. DOI: 10.7540/j.ynu.20190502

A fast parallel decision tree algorithm for big data analysis

  • In order to improve the training efficiency of decision tree based on large-scale data, this paper proposes a parallel decision tree algorithm (SPDT) based on Spark platform. Firstly, this study proposes a method of data partition by column, which keeps a single attribute column in a partition completely. The data node that caches the partition data can complete the calculation of information entropy independently, which reduces the occupation of network resources caused by the information exchange between data nodes. Then, the data is cached in memory in the form of dense vectors after being partitioned by column. SPDT compresses the data to reduce the memory consumption. Finally, SPDT uses the continuous attribute discretization method based on the boundary point category judgment to deal with the continuous attribute, which reduces the frequency of information entropy calculation in the process of decision tree training, and proposes the method of dividing the training data set by the information gain ratio to reduce the dependence of information gain calculation on multi-attribute value attributes. The experimental results show that in terms of tree training efficiency, SPDT has a significant improvement over Apache Spark-MLlib decision tree algorithm (MLDT) and Spark platform based vertical division decision tree algorithm (Yggdrasil) while maintaining classification accuracy.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return