Abstract:
Hawthorn fruits of different varieties have varied nutritional composition, sensory properties etc., thus required for different processing for product development. Due to the limitations of traditional analytical methods of time-consuming, destructive sample preparation, and high cost ect., non-destructive techniques for variety identification are needed which would benefit for large scale production of foods with hawthorn fruits. In this study, a total of 240 hawthorn fruit samples from four different varieties were subjected for near-infrared spectroscopy analysis and the collected spectral data were pre-processed by different algorithms. In order to achieve non-destructive identification of hawthorn varieties, natural language processing (NLP) model was applied for data analysis, including long short-term memory (LSTM), gated recurrent unit (GRU) neural network, logistic regression, native Bayes, decision trees, and k-nearest neighbors. The results showed that the two deep learning models both had the best discrimination effect on the spectral preprocessed by principal component analysis (PCA) with the accuracy of the validation set and test set reached 99.46%±0.00% and 100%±0.00%. While, the logistic regression model showed excellent discrimination ability for hawthorn fruit spectra but poor discrimination ability for the difference of second order (D2) pretreatment spectra (accuracy of 96.65% in the validation set and 89.58% in the test set). The naive Bayes model also showed excellent discrimination effect on the spectra processed by PCA, and the accuracy of the validation set was 95.65%, and the accuracy of the test set was 95.83%. Results gained in this study confirmed the feasibility of applying NLP to the near-infrared non-destructive identification of hawthorn fruits.