HUMAN ACTION RECOGNITION

The customer is in the multi-media and entertainment industry and would like to use the model to process a video/scene and get a caption for the video/scene.

TECHNOLOGIES / TOOLS

Caffe

OpenCV

CNN – Object detection – VGG trained on ILSVRC-2012 dataset

LSTM networks

MODEL HIGHLIGHTS

Sequence to Sequence

Can handle variable length video input and generate variable length caption output

Two stage LSTM, one for Sequence to Sequence encoding of the input video and the second LSTM for decoding and encoding, to model the output word sequence