陆旭, 陈毅红, 熊章瑞, 廖彬宇. 一种面向大数据分析的快速并行决策树算法[J]. 云南大学学报(自然科学版), 2020, 42(2): 244-251. doi: 10.7540/j.ynu.20190502
引用本文: 陆旭, 陈毅红, 熊章瑞, 廖彬宇. 一种面向大数据分析的快速并行决策树算法[J]. 云南大学学报(自然科学版), 2020, 42(2): 244-251. doi: 10.7540/j.ynu.20190502
LU Xu, CHEN Yi-hong, XIONG Zhang-rui, LIAO Bin-yu. A fast parallel decision tree algorithm for big data analysis[J]. Journal of Yunnan University: Natural Sciences Edition, 2020, 42(2): 244-251. DOI: 10.7540/j.ynu.20190502
Citation: LU Xu, CHEN Yi-hong, XIONG Zhang-rui, LIAO Bin-yu. A fast parallel decision tree algorithm for big data analysis[J]. Journal of Yunnan University: Natural Sciences Edition, 2020, 42(2): 244-251. DOI: 10.7540/j.ynu.20190502

一种面向大数据分析的快速并行决策树算法

A fast parallel decision tree algorithm for big data analysis

  • 摘要: 为了提高基于大规模数据的决策树训练效率,提出了一种基于Spark平台的并行决策树算法(SPDT). 首先,采用数据按列分区的方法,把单个属性列完整地保留在一个分区内,使缓存该分区数据的数据节点能独立完成信息熵的计算,以减少数据节点之间的信息交流造成的网络资源的占用. 然后,数据在按列分区后以稠密向量的形式缓存于内存中,SPDT对数据进行压缩,以减少对内存的占用. 最后,SPDT采用基于边界点类别判定的连续属性离散化方法来处理连续属性,减少决策树训练过程中信息熵计算的频次,并提出使用信息增益比划分训练数据集的方法,以减少信息增益计算对多属性值属性的依赖. 实验结果表明,在树的训练效率方面,SPDT在保持分类精度的情况下,比Apache Spark-MLlib决策树算法(MLDT)以及基于Spark平台的垂直划分决策树算法(Yggdrasil)有明显的提升.

     

    Abstract: In order to improve the training efficiency of decision tree based on large-scale data, this paper proposes a parallel decision tree algorithm (SPDT) based on Spark platform. Firstly, this study proposes a method of data partition by column, which keeps a single attribute column in a partition completely. The data node that caches the partition data can complete the calculation of information entropy independently, which reduces the occupation of network resources caused by the information exchange between data nodes. Then, the data is cached in memory in the form of dense vectors after being partitioned by column. SPDT compresses the data to reduce the memory consumption. Finally, SPDT uses the continuous attribute discretization method based on the boundary point category judgment to deal with the continuous attribute, which reduces the frequency of information entropy calculation in the process of decision tree training, and proposes the method of dividing the training data set by the information gain ratio to reduce the dependence of information gain calculation on multi-attribute value attributes. The experimental results show that in terms of tree training efficiency, SPDT has a significant improvement over Apache Spark-MLlib decision tree algorithm (MLDT) and Spark platform based vertical division decision tree algorithm (Yggdrasil) while maintaining classification accuracy.

     

/

返回文章
返回