-
Notifications
You must be signed in to change notification settings - Fork 262
Description
Describe the bug
I have a Hugging Face model "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" and a PEFT adapter. I use auto-opt command and both HF model & PEFT adapter as inputs to generate the ONNX model. I use convert-adapters command and PEFT adapter as input to generate the ONNX adapter file. However, the ONNX model and the ONNX adapter cannot work. The runtime error is "RuntimeError: Invalid input name: model.layers.12.self_attn.v_proj.lora_A.weight"
To Reproduce
generate a PEFT adapter
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
model = AutoModelForCausalLM.from_pretrained(model_name)
# Create a LoRA config
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj","k_proj","v_proj","o_proj"],
lora_dropout=0.0,
)
# Create a PEFT model wrapper
peft_model = get_peft_model(model, lora_config)
# Optionally train the model. But this won't impact the repro of the bug
# Save the LoRA adapter
peft_model.save_pretrained("empty_lora")generate the ONNX model
Please note, both the HF model name and the PEFT adapter are inputs. auto-opt internally uses ModelBuilder and ExtractAdapter pass. Therefore, auto-opt can generate the ONNX model which has adapter slots and an ONNX adapter file. We use the ONNX model only for the repro.
olive auto-opt
--model_name_or_path "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
--adapter_path empty_lora
--device cpu
--provider CPUExecutionProvider
--use_model_builder
--output_path basemodel-with-slots
--log_level 0
generate ONNX adapter file
olive convert-adapters
--adapter_path empty_lora
--output_path convert_adapter_result
--log_level 0
inference
I mostly leverages the inference code from olive example. Paste the same code below
import onnxruntime_genai as og
import time
model_folder = "basemodel-with-slots" #olive auto-opt generated
#adapter_path = "basemodel-with-slots/adapter_weights.onnx_adapter" #olive auto-opt generated, inference OK
adapter_path = "convert_adapter_result.onnx_adapter" #olive convert-adapters generated, cannot inference
# Load the base model and tokenizer
model = og.Model(model_folder)
print(dir(model))
adapters = og.Adapters(model) #Adapter code
adapters.load(adapter_path, "en_medical_reasoning") #Adapter code
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
prompt_template = """
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question.
### Question:
{}
### Response:
<think>
"""
question = """
A 54-year-old construction worker with a long history of smoking presents with swelling in his upper extremity and face, along with
dilated veins in this region. After conducting a CT scan and venogram of the neck, what is the most likely diagnosis for the cause of these symptoms?
"""
prompt = prompt_template.format(question, "")
# Encode the prompt using the tokenizer
input_tokens = tokenizer.encode(prompt)
# Create params and generator
params = og.GeneratorParams(model)
generator = og.Generator(model, params)
generator.set_active_adapter(adapters, "en_medical_reasoning") #Adapter code
# Append input tokens to the generator
generator.append_tokens(input_tokens)
print("")
print("Output: ", end="", flush=True)
token_times = []
# Stream the output
while True:
start_time = time.time()
if generator.is_done():
break
generator.generate_next_token()
end_time = time.time()
# Record the time for this token generation
token_time = end_time - start_time
token_times.append(token_time)
new_token = generator.get_next_tokens()[0]
print(tokenizer_stream.decode(new_token), end="", flush=True)
print()
# Calculate and display timing statistics
if token_times:
total_tokens = len(token_times)
avg_time = sum(token_times) / total_tokens
print(f"Total tokens generated: {total_tokens}")
print(f"Average time per token: {avg_time:.4f} seconds")
print(f"Tokens per second: {total_tokens / sum(token_times):.2f}")
del generatorActual behavior
- The ONNX adapter file generated by
convert-adaptercannot work with the ONNX model generated byauto-opt. By inspecting the ONNX model, I think the root cause is the adapter input name in the model and in the adapter file don't match. - The ONNX adapter file and ONNX model both generated by
auto-optcan work. But this is not what the issue complains about.
Expected behavior
The ONNX adapter file generated by convert-adapter should work with the ONNX model generated by auto-opt.
If this issue is fixed, then I just need to create ONNX model once with auto-opt command. Every time I do a new finetuning, I just need to convert the PEFT adapter to ONNX adapter without generating the ONNX format model again.
Other information
- OS: Windows
- Olive version: 0.10.1
- ONNXRuntime package and version: onnxruntime 1.23.2, onnxruntime_genai 0.10.0
- Transformers package version: [e.g. transformers 4.57.1]