Abstract:
Support Vector Machines(SVM) is very powerful tools in the fields of classification and regression, but they face the problems of high resource occupation and slow finding speed in the big data environment. Although the current SVM implemented in the big data framework optimizes the problem of slow search speed, its prediction accuracy has a certain gap compared with the direct training method, and it does not have a reasonable allocation of resources for training nodes. Therefore, a distributed balanced cascade vector machine under Flink platform is proposed, which divides the dataset into balanced subsets containing the same proportion of samples and deflates the training parameters of the subsets on the basis of the previous approach. Meanwhile, the dynamic resource allocation strategy of iterative operations under Flink is combined to minimize the resources of each node to just meet the training demand. The article illustrates the effectiveness of the method. By comparing the resource occupation and model accuracy of several datasets under different training methods, the experimental results show that the proposed training method can be used to allocate resources reasonably and flexibly, while reducing the model prediction accuracy error to about 0.1%.