Abstract:
In order to improve the training efficiency of decision tree based on large-scale data, this paper proposes a parallel decision tree algorithm (SPDT) based on Spark platform. Firstly, this study proposes a method of data partition by column, which keeps a single attribute column in a partition completely. The data node that caches the partition data can complete the calculation of information entropy independently, which reduces the occupation of network resources caused by the information exchange between data nodes. Then, the data is cached in memory in the form of dense vectors after being partitioned by column. SPDT compresses the data to reduce the memory consumption. Finally, SPDT uses the continuous attribute discretization method based on the boundary point category judgment to deal with the continuous attribute, which reduces the frequency of information entropy calculation in the process of decision tree training, and proposes the method of dividing the training data set by the information gain ratio to reduce the dependence of information gain calculation on multi-attribute value attributes. The experimental results show that in terms of tree training efficiency, SPDT has a significant improvement over Apache Spark-MLlib decision tree algorithm (MLDT) and Spark platform based vertical division decision tree algorithm (Yggdrasil) while maintaining classification accuracy.