In clinical, manually scoring by technician is the major method for sleep arousal detection. This method is time-consuming and subjective. This study aimed to achieve an end-to-end sleep-arousal events detection by constructing a convolutional neural network based on multi-scale convolutional layers and self-attention mechanism, and using 1 min single-channel electroencephalogram (EEG) signals as its input. Compared with the performance of the baseline model, the results of the proposed method showed that the mean area under the precision-recall curve and area under the receiver operating characteristic were both improved by 7%. Furthermore, we also compared the effects of single modality and multi-modality on the performance of the proposed model. The results revealed the power of single-channel EEG signals in automatic sleep arousal detection. However, the simple combination of multi-modality signals may be counterproductive to the improvement of model performance. Finally, we also explored the scalability of the proposed model and transferred the model into the automated sleep staging task in the same dataset. The average accuracy of 73% also suggested the power of the proposed method in task transferring. This study provides a potential solution for the development of portable sleep monitoring and paves a way for the automatic sleep data analysis using the transfer learning method.