In this paper we present an effective framework for multimodal emotion recognition based on a novel approach for automatic peak frame selection from audio-visual video sequences. Given a video with an emotional expression, peak frames are the ones at which the emotion is at its apex. The objective of peak frame selection is to make the training process for the automatic emotion recognition system easier by summarizing the expressed emotion over a video sequence. The main steps of the proposed framework consists of extraction of video and audio features based on peak frame selection, unimodal classification and decision level fusion of audio and visual results. We evaluated the performance of our approach on eNTERFACE'05 audio-visual database containing six basic emotional classes. Experimental results demonstrate the effectiveness and superiority of the proposed system over other methods in the literature.