Flink平台下的分布式平衡级联支持向量机

Distributed balanced cascade Support Vector Machines under Flink platform

  • 摘要: 支持向量机(Support Vector Machines,SVM)在分类和回归领域都是非常强大的工具,但在大数据环境下,其面临资源占用过高和寻优速度慢等问题. 目前利用大数据框架实现的SVM,虽然优化了寻优速度慢的问题,但其预测精度与直接训练方式相比存在一定的差距,此外其并没有对训练节点的资源进行合理配置. 故提出一种Flink平台下的分布式平衡级联向量机,该方式在之前的基础上将数据集分成含有相同比例样本的平衡子集,并对子集的训练参数进行放缩;同时,结合Flink下迭代作业的动态资源分配策略,将各节点资源最小化为刚好满足训练需求. 对该方法的有效性进行阐述,对比多个数据集在不同训练方式下的资源占用和模型精度,实验结果表明,采用所提出的训练方式能合理灵活地对资源进行配置,同时将模型预测精度误差降低到0.1%以内.

     

    Abstract: Support Vector Machines(SVM) is very powerful tools in the fields of classification and regression, but they face the problems of high resource occupation and slow finding speed in the big data environment. Although the current SVM implemented in the big data framework optimizes the problem of slow search speed, its prediction accuracy has a certain gap compared with the direct training method, and it does not have a reasonable allocation of resources for training nodes. Therefore, a distributed balanced cascade vector machine under Flink platform is proposed, which divides the dataset into balanced subsets containing the same proportion of samples and deflates the training parameters of the subsets on the basis of the previous approach. Meanwhile, the dynamic resource allocation strategy of iterative operations under Flink is combined to minimize the resources of each node to just meet the training demand. The article illustrates the effectiveness of the method. By comparing the resource occupation and model accuracy of several datasets under different training methods, the experimental results show that the proposed training method can be used to allocate resources reasonably and flexibly, while reducing the model prediction accuracy error to about 0.1%.

     

/

返回文章
返回