My research interests lie in computer vision, machine learning and multimedia, with a special focus on developing models that can learn general and high-level knowledge about the world from multi-modality data
like videos and language.
A large-scale video dataset with densely annotated paragraph timestamps to enable the new research direction of multi-paragraph video grounding on both long-form videos and long-term queries.
First attempt to explore weakly-supervised setting of video paragraph grounding, where a siamese learning framework jontly conducting feature alignment and boundary regression is proposed.
Introducing hierarchical modeling into video paragraph grounding by hierarchically aligning semantic correspondence across videos and paragraphs for temporal decoding at multiple granularities.