Skip to content

Code release for "Improving diffusion-based protein backbone generation with global-geometry-aware latent encoding"

License

Notifications You must be signed in to change notification settings

meneshail/TopoDiff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TopoDiff

πŸ“ Log & TODO

  • 2025-08-04
    • Updated data preprocessing scripts. βš™οΈ
  • August 2025 (Scheduled)
    • γ€ŒGone with the Flow」: Update flow-based code for training and sampling. 🌊

🌟 Introduction

This is the official code repository of the paper "TopoDiff: Improving Diffusion-Based Protein Backbone Generation with Global-Geometry-aware Latent Encoding".

Building on the success of diffusion-based protein backbone generation, we propose TopoDiff, a novel framework allowing the unsupervised learning and utilization of a global-geometry-aware latent representation, which helps to enhance the coverage of generated backbones as well as to gain additional controllability on the generation process. We provide the scripts and weights used for all the experiments in the paper.

πŸ“¦ Installation

We recommend using conda/mamba to install the dependencies. We also recommend installing OpenFold to allow the use of a memory-efficient kernel for attention computation, although this is optional and should not affect the results.

git clone https://github.com/meneshail/TopoDiff.git <repo_name>

# create the conda environment
cd <repo_name>/TopoDiff
conda env create -n topodiff_env -f env.yml
conda activate topodiff_env

# installation
# in <repo_name>/TopoDiff/
pip install -e .
cd ..

# download and uncompress the weights and dataset(optional) from https://zenodo.org/records/13879812, and put them in the data directory
mkdir data
mv path/to/download/weights data/
mv path/to/download/dataset data/

The final project structure should look like this:

repo
β”œβ”€β”€ data
β”‚Β Β  β”œβ”€β”€ dataset
β”‚Β Β  β”œβ”€β”€ weights
|   |── ...
β”œβ”€β”€ notebook
β”‚Β Β  β”œβ”€β”€ 0_ ...
β”‚Β Β  β”œβ”€β”€ 1_ ...
β”‚Β Β  β”œβ”€β”€ 2_ ...
β”œβ”€β”€ TopoDiff  

πŸš€ Usage

Sampling script

python: run_sampling.py [-h] [-o OUTDIR] [-v VERSION] [-m MODE] [-s START] [-e END] [-i INTERVAL] [-n NUM_SAMPLES] [--pred_sc] [--min_sc MIN_SC] [--max_sc MAX_SC] [--pred_novelty] [--min_novelty MIN_NOVELTY] [--max_novelty MAX_NOVELTY] [--pred_alpha] [--min_alpha MIN_ALPHA] [--max_alpha MAX_ALPHA] [--pred_beta] [--min_beta MIN_BETA] [--max_beta MAX_BETA] [--pred_coil] [--min_coil MIN_COIL] [--max_coil MAX_COIL] [--soft_prob SOFT_PROB] [--seed SEED] [--gpu GPU] [--num_k NUM_K] [--epsilon EPSILON]


# e.g.
# sample 10 backbones of length 100, 110, 120, sampling in all_round preference mode (recommended)
# python run_sampling.py -o sampling_result -s 100 -e 120 -n 10 -i 10 -m all_round

# same, but sampling in base preference mode
# python run_sampling.py -o sampling_result -s 100 -e 120 -n 10 -i 10

Arguments:

-h, --help            show this help message and exit
-o OUTDIR, --outdir OUTDIR
                    The output directory
-v VERSION, --version VERSION
                    The version of the model, default: v1_1_2 (recommended)
-m MODE, --mode MODE  
                    The mode of sampling (model variants with different sampling preference), default: None. 
                    Available options [base, designability, novelty, all_round]. (The variants used in the paper) 
                    Note that set this to a valid option will orverride the pred_* options.
-s START, --start START
                    The start length of sampling, must be larger than 50, default: 100
-e END, --end END     
                    The end length of sampling (inclusive), must be smaller than 250, default: 100
-i INTERVAL, --interval INTERVAL
                    The interval of sampling length, default: 10
-n NUM_SAMPLES, --num_samples NUM_SAMPLES
                    The number of samples to generate for each length, default: 5
--pred_sc             Whether to predict designability score, default: False
--min_sc MIN_SC       The minimum predicted designability score of the latent, default: 0.0
--max_sc MAX_SC       The maximum predicted designability score of the latent, default: 1.0
--pred_novelty        Whether to predict novelty score, default: False
--min_novelty MIN_NOVELTY
                    The minimum predicted novelty score of the latent, default: 0.0
--max_novelty MAX_NOVELTY
                    The maximum predicted novelty score of the latent, default: 1.0
--pred_alpha          Whether to predict alpha ratio, default: False
--min_alpha MIN_ALPHA
                    The minimum predicted alpha ratio of the latent, default: 0.0
--max_alpha MAX_ALPHA
                    The maximum predicted alpha ratio of the latent, default: 1.0
--pred_beta           Whether to predict beta ratio, default: False
--min_beta MIN_BETA   The minimum predicted beta ratio of the latent, default: 0.0
--max_beta MAX_BETA   The maximum predicted beta ratio of the latent, default: 1.0
--pred_coil           Whether to predict coil ratio, default: False
--min_coil MIN_COIL   The minimum predicted coil ratio of the latent, default: 0.0
--max_coil MAX_COIL   The maximum predicted coil ratio of the latent, default: 1.0
--soft_prob SOFT_PROB
                    The probability for accepting latent codes failed to pass all classifiers, default: 0.1
--seed SEED           The random seed for sampling, default: 42
--gpu GPU             The gpu id for sampling, default: None
--num_k NUM_K         The number of k to decide the expected length of the latent, default: 1
--epsilon EPSILON     The range of variation of the expected length of the latent, default: 0.2

The output directory will be arranged as follows:

outdir
β”œβ”€β”€ length_100
β”‚Β Β  β”œβ”€β”€ sample_0.pdb
β”‚Β Β  β”œβ”€β”€ sample_1.pdb...
β”œβ”€β”€ length_110
β”‚Β Β  β”œβ”€β”€ sample_0.pdb
β”‚Β Β  β”œβ”€β”€ sample_1.pdb...
...

Notebook

We also provide a series of notebooks to help you walk through the functionalities of the model. They are located in the notebook directory.

Training

# inside repo
# download also from https://zenodo.org/records/13879812
mv path/to/download/train_data data/

# training setting of all stages are available in TopoDiff.config, here we directly start from stage 3 (with encoder)
mkdir experiments

# structure diffusion, suppose we have 4 gpus to use
CUDA_VISIBLE_DEVICES="0,1,2,3"  torchrun --nproc_per_node 4 --master_port <port> ./TopoDiff/run_training.py -o ./experiments --stage 3 --model structure --init_ckpt ./data/weights/v1_1/model.ckpt -gpu 0,1,2,3

# latent diffusion
python ./TopoDiff/run_training.py -o ./experiments --model latent --latent_epoch <epc> --gpu 0

# this will pack all necessary model weights and config into a single file at ./experiments/ckpt/epoch_<epc>.ckpt, and you can use it for sampling with the following command
python ./TopoDiff/run_sampling.py -s 125 -e 125 -n 25 -v custom --ckpt ./experiments/ckpt/epoch_<epc>.ckpt -o ./experiments/sample/

Data Preprocessing (Optional)

If you want to process your own PDB dataset, you can use the provided preprocessing script. This script will automatically scan for all .pdb files in the input directory, process them in parallel into feature files, and save them in the output directory.

# Example: Process PDB files in data/raw_pdbs/ using 32 workers
# The output will be saved in data/processed_data/, along with a metadata file info.json
topodiff-preprocess --input_dir ./data/raw_pdbs/ --output_dir ./data/processed_data/ --n_worker 32

Evaluation

Diversity & Coverage

We currently provide the evaluation scripts for the diversity and the newly proposed coverage metrics. They are located in the TopoDiff/evaluation directory.

We recommend first walk through the notebook 3_metrics.ipynb to understand the usage of the evaluation script.

To use the evaluation script, additional precomputed data are required to be downloaded from our Zenodo repository.

# inside repo
# download also from https://zenodo.org/records/13879812
mv path/to/download/evaluation/ data/

# init and download progres as submodule
git submodule init
git submodule update

# download the model and CATH embeddings following official instructions
mkdir TopoDiff/progres/progres/trained_models/v_0_2_0/
wget https://zenodo.org/records/7782089/files/trained_model.pt -O TopoDiff/progres/progres/trained_models/v_0_2_0/trained_model.pt
mkdir TopoDiff/progres/progres/databases/v_0_2_0/
wget https://zenodo.org/records/7782089/files/cath40.pt -O TopoDiff/progres/progres/databases/v_0_2_0/cath40.pt

Designability

We provide the scripts for the designability evaluation in the 'topodiff_eval/sc/' directory. Due to the different dependencies, we recommend installing the package in a new environment.

  • installation:
# inside <repo>
# create a new environment
mamba env create -n topodiff_eval -f se3.yml
mamba activate topodiff_eval

# install the package
cd topodiff_eval
pip install -e .
  • run the script:
python topodiff_eval/sc/run_sc.py \
    --gpu_list GPU_LIST \
    --sample_root SAMPLE_ROOT \
    --sc_test_root SC_TEST_ROOT \
    --length_list LENGTH_LIST \
    --n_sample N_SAMPLE \
    --seq_per_sample SEQ_PER_SAMPLE \
    --run_phase_1 \
    --run_phase_2

The script is adapted from FrameDiff. A notable modification is the computation of RMSD. We utilize the more commonly used formula:

$$RMSD_{standard} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} | \vec{x}_i - \vec{y}_i |_2^2}$$

while the original implementation uses:

$$RMSD_{FrameDiff} = \frac{1}{N} \sum_{i=1}^{N} | \vec{x}_i - \vec{y}_i |_2$$

You can find related discussion in this issue.

πŸ“š Reference

Improving Diffusion-Based Protein Backbone Generation with Global-Geometry-aware Latent Encoding

❀️ Acknowledgements

We adapted some codes from OpenFold, FrameDiff, diffae and progres. We thank the authors for their impressive work.

  1. Ahdritz, G., Bouatta, N., Kadyan, S., Xia, Q., Gerecke, W., O’Donnell, T. J., ... & AlQuraishi, M. (2022). OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv, 2022-11.
  2. Yim, J., Trippe, B. L., De Bortoli, V., Mathieu, E., Doucet, A., Barzilay, R., & Jaakkola, T. (2023). SE (3) diffusion model with application to protein backbone generation. arXiv preprint arXiv:2302.02277.
  3. Preechakul, K., Chatthee, N., Wizadwongsa, S., & Suwajanakorn, S. (2022). Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10619-10629).
  4. Greener, J. G., & Jamali, K. (2022). Fast protein structure searching using structure graph embeddings. bioRxiv, 2022-11.

About

Code release for "Improving diffusion-based protein backbone generation with global-geometry-aware latent encoding"

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •