Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at ...
Abstract: Text-Pedestrian Image Retrieval employs textual description of pedestrian's appearance to identify the corresponding pedestrian image. This task involves modality discrepancy and the ...
Abstract: Contemporary advancements in Earth observation technologies have generated substantial data resources for remote sensing image retrieval applications. However, existing models exhibit ...