Skip to content

ONNX adapter created by "olive convert-adapters" command cannot work with ONNX model created by "olive auto-opt" #2277

@zhenchaoni

Description

@zhenchaoni

Describe the bug
I have a Hugging Face model "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" and a PEFT adapter. I use auto-opt command and both HF model & PEFT adapter as inputs to generate the ONNX model. I use convert-adapters command and PEFT adapter as input to generate the ONNX adapter file. However, the ONNX model and the ONNX adapter cannot work. The runtime error is "RuntimeError: Invalid input name: model.layers.12.self_attn.v_proj.lora_A.weight"

To Reproduce
generate a PEFT adapter

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
model = AutoModelForCausalLM.from_pretrained(model_name)

# Create a LoRA config
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
    lora_dropout=0.0,
)

# Create a PEFT model wrapper
peft_model = get_peft_model(model, lora_config)

# Optionally train the model. But this won't impact the repro of the bug

# Save the LoRA adapter
peft_model.save_pretrained("empty_lora")

generate the ONNX model

Please note, both the HF model name and the PEFT adapter are inputs. auto-opt internally uses ModelBuilder and ExtractAdapter pass. Therefore, auto-opt can generate the ONNX model which has adapter slots and an ONNX adapter file. We use the ONNX model only for the repro.

olive auto-opt
    --model_name_or_path "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" 
    --adapter_path empty_lora 
    --device cpu 
    --provider CPUExecutionProvider 
    --use_model_builder 
    --output_path basemodel-with-slots 
    --log_level 0

generate ONNX adapter file

olive convert-adapters 
    --adapter_path empty_lora 
    --output_path convert_adapter_result 
    --log_level 0

inference

I mostly leverages the inference code from olive example. Paste the same code below

import onnxruntime_genai as og
import time

model_folder = "basemodel-with-slots" #olive auto-opt generated
#adapter_path = "basemodel-with-slots/adapter_weights.onnx_adapter" #olive auto-opt generated, inference OK
adapter_path = "convert_adapter_result.onnx_adapter" #olive convert-adapters generated, cannot inference

# Load the base model and tokenizer
model = og.Model(model_folder)
print(dir(model))
adapters = og.Adapters(model) #Adapter code
adapters.load(adapter_path, "en_medical_reasoning") #Adapter code
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

prompt_template = """
Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>
"""

question = """
        A 54-year-old construction worker with a long history of smoking presents with swelling in his upper extremity and face, along with 
        dilated veins in this region. After conducting a CT scan and venogram of the neck, what is the most likely diagnosis for the cause of these symptoms?
"""
prompt = prompt_template.format(question, "")

# Encode the prompt using the tokenizer
input_tokens = tokenizer.encode(prompt)

# Create params and generator
params = og.GeneratorParams(model)
generator = og.Generator(model, params)
generator.set_active_adapter(adapters, "en_medical_reasoning") #Adapter code

# Append input tokens to the generator
generator.append_tokens(input_tokens)

print("")
print("Output: ", end="", flush=True)

token_times = []

# Stream the output
while True:
    start_time = time.time()
    if generator.is_done():
        break
    generator.generate_next_token()
    end_time = time.time()
    
    # Record the time for this token generation
    token_time = end_time - start_time
    token_times.append(token_time)

    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end="", flush=True)

print()

# Calculate and display timing statistics
if token_times:
    total_tokens = len(token_times)
    avg_time = sum(token_times) / total_tokens
    
    print(f"Total tokens generated: {total_tokens}")
    print(f"Average time per token: {avg_time:.4f} seconds")
    print(f"Tokens per second: {total_tokens / sum(token_times):.2f}")

del generator

Actual behavior

  • The ONNX adapter file generated by convert-adapter cannot work with the ONNX model generated by auto-opt. By inspecting the ONNX model, I think the root cause is the adapter input name in the model and in the adapter file don't match.
  • The ONNX adapter file and ONNX model both generated by auto-opt can work. But this is not what the issue complains about.

Expected behavior
The ONNX adapter file generated by convert-adapter should work with the ONNX model generated by auto-opt.
If this issue is fixed, then I just need to create ONNX model once with auto-opt command. Every time I do a new finetuning, I just need to convert the PEFT adapter to ONNX adapter without generating the ONNX format model again.

Other information

  • OS: Windows
  • Olive version: 0.10.1
  • ONNXRuntime package and version: onnxruntime 1.23.2, onnxruntime_genai 0.10.0
  • Transformers package version: [e.g. transformers 4.57.1]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions