A fast parallel decision tree algorithm for big data analysis

LU Xu; CHEN Yi-hong; XIONG Zhang-rui; LIAO Bin-yu

doi:10.7540/j.ynu.20190502

LU Xu, CHEN Yi-hong, XIONG Zhang-rui, LIAO Bin-yu. A fast parallel decision tree algorithm for big data analysis[J]. Journal of Yunnan University: Natural Sciences Edition, 2020, 42(2): 244-251. DOI: 10.7540/j.ynu.20190502

Citation:

A fast parallel decision tree algorithm for big data analysis

Graphical Abstract

Graphical Abstract

Abstract

Abstract

In order to improve the training efficiency of decision tree based on large-scale data, this paper proposes a parallel decision tree algorithm (SPDT) based on Spark platform. Firstly, this study proposes a method of data partition by column, which keeps a single attribute column in a partition completely. The data node that caches the partition data can complete the calculation of information entropy independently, which reduces the occupation of network resources caused by the information exchange between data nodes. Then, the data is cached in memory in the form of dense vectors after being partitioned by column. SPDT compresses the data to reduce the memory consumption. Finally, SPDT uses the continuous attribute discretization method based on the boundary point category judgment to deal with the continuous attribute, which reduces the frequency of information entropy calculation in the process of decision tree training, and proposes the method of dividing the training data set by the information gain ratio to reduce the dependence of information gain calculation on multi-attribute value attributes. The experimental results show that in terms of tree training efficiency, SPDT has a significant improvement over Apache Spark-MLlib decision tree algorithm (MLDT) and Spark platform based vertical division decision tree algorithm (Yggdrasil) while maintaining classification accuracy.

FullText(HTML)

References (22)

Cited By

Turn off MathJax

Article Contents

A fast parallel decision tree algorithm for big data analysis

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content