In the current digital century, there are plenty of radio stations to choose from. However, the choice usually is only based on the music genre, and the listener has to recognize if the program, schedule, and amount of talking suits their demands. In order to compare the amount of music/talking on a radio station, it could either be compared manually by listening, although, in modern times, this could also be automated by the usage of machine learning. This study concentrates on the recognition of speech and non-speech on their patterns by using radio productions as input and optimizing the extraction of numerical values, algorithms, and methods to combine and precise the accuracy over distinguishing the different categories and labels. The distinguishing is achieved by using knowledge from earlier research and combining modern newly introduced technologies and ideas, the paper experiments with a multi-layer classical machine learning setup. The numerical extraction from the audio input is executed with the usage of existing research and technologies from the digital signal processing and audio processing fields in combination with optimized parameters.
Based on the literature review, the experimental setup extracts a set of features from the audio tracks, which are manually labeled to create ground truth label data. The experiments are covering three algorithms and will compare not only the algorithms but also the methods of extracting by tuning the hop and window sizes. Furthermore, two algorithms in the multi-layer setup are being parameter tuned using grid-search methods to result in an optimal setup specialized on the numerical data.
The results indicate that the numerical extraction and the decision between the hop and window size is one of the most critical parameters. Furthermore, the results indicate that both MLP and XGBoost are very good in performance and show both similar results with negligible differences. Further research and experiments are demanded to optimize and increase the performance of the models by, for example, focusing on silence periods and reducing the impact of background noise on the performance.
Collections
Show Collections