一种面向大数据分析的快速并行决策树算法

陆旭; 陈毅红; 熊章瑞; 廖彬宇

doi:10.7540/j.ynu.20190502

一种面向大数据分析的快速并行决策树算法

A fast parallel decision tree algorithm for big data analysis

摘要

摘要: 为了提高基于大规模数据的决策树训练效率，提出了一种基于Spark平台的并行决策树算法（SPDT）. 首先，采用数据按列分区的方法，把单个属性列完整地保留在一个分区内，使缓存该分区数据的数据节点能独立完成信息熵的计算，以减少数据节点之间的信息交流造成的网络资源的占用. 然后，数据在按列分区后以稠密向量的形式缓存于内存中，SPDT对数据进行压缩，以减少对内存的占用. 最后，SPDT采用基于边界点类别判定的连续属性离散化方法来处理连续属性，减少决策树训练过程中信息熵计算的频次，并提出使用信息增益比划分训练数据集的方法，以减少信息增益计算对多属性值属性的依赖. 实验结果表明，在树的训练效率方面，SPDT在保持分类精度的情况下，比Apache Spark-MLlib决策树算法（MLDT）以及基于Spark平台的垂直划分决策树算法（Yggdrasil）有明显的提升.

Abstract: In order to improve the training efficiency of decision tree based on large-scale data, this paper proposes a parallel decision tree algorithm (SPDT) based on Spark platform. Firstly, this study proposes a method of data partition by column, which keeps a single attribute column in a partition completely. The data node that caches the partition data can complete the calculation of information entropy independently, which reduces the occupation of network resources caused by the information exchange between data nodes. Then, the data is cached in memory in the form of dense vectors after being partitioned by column. SPDT compresses the data to reduce the memory consumption. Finally, SPDT uses the continuous attribute discretization method based on the boundary point category judgment to deal with the continuous attribute, which reduces the frequency of information entropy calculation in the process of decision tree training, and proposes the method of dividing the training data set by the information gain ratio to reduce the dependence of information gain calculation on multi-attribute value attributes. The experimental results show that in terms of tree training efficiency, SPDT has a significant improvement over Apache Spark-MLlib decision tree algorithm (MLDT) and Spark platform based vertical division decision tree algorithm (Yggdrasil) while maintaining classification accuracy.

HTML全文

参考文献(22)

施引文献

资源附件(0)