Recent advances in text-video retrieval have been largely driven by contrastive learning frameworks. However, existing methods often overlook the impact of modality gaps, which causes anchor representations to undergo in-place optimization (i.e., optimization tension), where gradients from positive and negative pairs cancel out, limiting alignment capacity. Moreover, noisy hard negatives further distort semantic learning of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment
conda create -n GARE python=3.8
conda activate GARE
pip install -r requirements.txt
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117cd tvr/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.ptpython preprocess/compress_video.py --input_root [raw_video_path] --output_root [compressed_video_path]This script will compress the video to 3fps with width 224 (or height 224). Modify the variables for your customization.
export CUDA_VISIBLE_DEVICES=0,1,2,3
DATA_PATH=/home/username/gare/data/MST-VTT
python -m torch.distributed.launch \
--master_port 29510 \
--nproc_per_node=4 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 50 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/3fps_videos \
--datatype msrvtt \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--output_dir ckpts/ckpt_msrvtt_retrieval_looseType \
--temp 3 \
--alpha 2 \
--beta 0.07 \
--lambda_dir 0.01 \
--lambda_epsilon 0.01 \
--lambda_lower 0.5