Skip to content

How To Do Stable Diffusion Textual Inversion TI Text Embeddings By Automatic1111 Web UI Tutorial

FurkanGozukara edited this page Oct 27, 2025 · 1 revision

How To Do Stable Diffusion Textual Inversion (TI) / Text Embeddings By Automatic1111 Web UI Tutorial

How To Do Stable Diffusion Textual Inversion (TI) / Text Embeddings By Automatic1111 Web UI Tutorial

image Hits Patreon BuyMeACoffee Furkan Gözükara Medium Codio Furkan Gözükara Medium

YouTube Channel Furkan Gözükara LinkedIn Udemy Twitter Follow Furkan Gözükara

Our Discord : https://discord.gg/HbqgGaZVmr. Grand Master tutorial for Textual Inversion / Text Embeddings. If I have been of assistance to you and you would like to show your support for my work, please consider becoming a patron on 🥰 https://www.patreon.com/SECourses

Playlist of Stable Diffusion Tutorials, Automatic1111 and Google Colab Guides, DreamBooth, Textual Inversion / Embedding, LoRA, AI Upscaling, Pix2Pix, Img2Img:

https://www.youtube.com/playlist?list=PL_pbwdIyffsmclLl0O144nQRnezKlNdx3

In this video, I am explaining almost every aspect of Stable Diffusion Textual Inversion (TI) / Text Embeddings. I am demonstrating a live example of how to train a person face with all of the best settings including technical details.

TI Academic Paper: https://arxiv.org/pdf/2208.01618.pdf

Automatic1111 Repo: https://github.com/AUTOMATIC1111/stable-diffusion-webui

Easiest Way to Install & Run Stable Diffusion Web UI on PC

https://youtu.be/AZg6vzWHOTA

How to use Stable Diffusion V2.1 and Different Models in the Web UI

https://youtu.be/aAyvsX-EpG4

Automatic1111 Used Commit : d8f8bcb821fa62e943eb95ee05b8a949317326fe

Git Bash : https://git-scm.com/downloads

Automatic1111 Command Line Arguments List: https://bit.ly/StartArguments

S.D. 1.5 CKPT: https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main

Latest Best S.D. VAE File: https://huggingface.co/stabilityai/sd-vae-ft-mse-original/tree/main

VAE File Explanation: https://bit.ly/WhatIsVAE

Cross attention optimizations bug: https://bit.ly/CrosOptBug

Vector Pull Request: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/6667

All of the tokens list in Stable Diffusion: https://huggingface.co/openai/clip-vit-large-patch14/tree/main

Example training dataset used in the video:

https://drive.google.com/file/d/1Hom2XbILub0hQc-zmLizRcwFrKwHYGcc/view?usp=sharing

Inspect-Embedding-Training Script repo:

https://github.com/Zyin055/Inspect-Embedding-Training

How to Inject Your Trained Subject: https://youtu.be/s25hcW4zq4M

Comparison of training techniques: https://bit.ly/TechnicComparison

Embedding file name list generator script:

https://jsfiddle.net/MonsterMMORPG/Lg0swc1b/10/

00:00:00 Introduction to #StableDiffusion #TextualInversion Embeddings

00:01:00 Which commit of the #Automatic1111 Web UI we are using and how to checkout / switch to specific commit of any Git project

00:04:07 Used command line arguments of Automatic1111 webui-user.bat file

00:04:35 Automatic1111 command line arguments

00:05:31 How to and where to put Stable Diffusion models and VAE files in Automatic1111 installation

00:06:05 Why do we use latest VAE file and what does VAE file do

00:08:24 Training settings of Automatic1111

00:10:38 All about names of text embeddings

00:11:00 What is initialization text of textual inversion training

00:11:32 Embedding inspector extension of Automatic1111

00:14:25 How to set number of vectors per token when doing Textual Inversion training

00:11:52 Technical and detailed explanation of tokens and their numerical weights vectors in Stable Diffusion

00:16:00 How the prompts getting tokenized - turned into tokens - by using tokenizer extension

00:18:58 Setting number of training vectors

00:20:24 Where embedding files are saved in automatic1111 installation

00:20:38 All about preprocess images before TI training

00:23:06 Training tab of textual inversion

00:23:18 What to and how to set embedding learning rate

00:23:40 What are the Batch size and Gradient accumulation steps and how to set them

00:24:40 How to set training learning rate according to Batch size and Gradient accumulation steps

00:26:21 What are prompt templates, what are they used for, how to set and use them in textual inversion training

00:29:06 What are filewords and how they are used in training in automatic1111 web ui

00:29:35 How to edit image captions when doing textual inversion training

00:31:07 From training images pool, how and why did i choose some of them and not all of them

00:31:54 Why I did add noise to the backgrounds of some training dataset images

00:32:07 How should be your training dataset. What is a good training dataset

00:34:48 Save TI training checkpoints

00:36:31 Which latent sampling method is best

00:39:59 Training started

00:38:08 Overclock GPU to get 10% training speed up

00:38:32 Where to find TI training preview images

00:39:15 Where to see used final prompts during training

00:41:34 How to use inspect_embedding_training script to determine overtraining of textual inversion

00:42:31 What is training loss

00:48:23 Technical difference of Textual Inversion, DreamBooth, LoRA and HyperNetworks training

00:52:17 Over 200 epochs and already got very good sample preview images

00:54:28 How to set newest VAE file as default in the settings of automatic1111 web ui

00:55:06 How to use generated embeddings checkpoint files

00:58:31 How to test different checkpoints via X/Y plot and embedding files name generator script

01:07:27 How to upscale image by using AI

01:08:42 How to use multiple embeddings in a prompt

Video Transcription

  • 00:00:01 Greetings everyone. Welcome to the most  comprehensive, technical, detailed and yet  

  • 00:00:07 still beginner-friendly Stable Diffusion Text  Embeddings, also known as Textual Inversion  

  • 00:00:12 training tutorial. In this video I am going to  cover all of the topics that you see here and  

  • 00:00:18 more. Currently I am hovering my mouse over there.  You can pause the video and check them out if you  

  • 00:00:25 wish. Also, you see here the training dataset we  used and here the textual embedding used results.  

  • 00:00:32 Let's start by quickly introducing what is textual  inversion and its officially released academic  

  • 00:00:39 paper. If you are interested in reading this  article, so you can open the link and read it,  

  • 00:00:47 I am also going to show some of the important  parts of this article when we are going to use  

  • 00:00:56 them. I will explain through the article.  So to do training, we are going to use  

  • 00:01:02 Automatic1111 web UI. If you don't know how to  install and set up the Automatic1111 web UI,  

  • 00:01:10 I already have a video for that on my channel:  Easiest Way to Install & Run Stable Diffusion  

  • 00:01:16 Web UI. Also, I have another video How to use  Stable Diffusion V2.1 and Different Models.  

  • 00:01:24 So I am going to use specific version of  the Automatic1111 web UI. It is constantly  

  • 00:01:30 getting updated and therefore it is constantly  getting broken and you are asking me that:  

  • 00:01:37 which version did you use? I am going to use this  specific version, this commit, because after bump  

  • 00:01:44 gradio to 3.16, it has given me a lot of errors.  So how am I going to use this specific version?  

  • 00:01:55 To use specific version, I am going to clone  it with Git Bash. If you didn't still install  

  • 00:02:00 Git Bash, you can install it by using  Google. Just type Git Bash to Google.  

  • 00:02:06 You can download from this website and  install it. It is so easy to install.  

  • 00:02:10 First I am going to select the folder where I want  to clone my Automatic1111 web UI. I am entering my  

  • 00:02:18 F drive and in here I am generating a new folder  with right click new folder. Let's give it a name  

  • 00:02:25 as tutorial web UI. OK, then we will move inside  this folder in our Git Bash window to do that.  

  • 00:02:36 Type CD F: Now we are in the F drive, then CD,  put a quotation mark like this and hit enter.  

  • 00:02:45 Now we are inside this folder. Now we can clone  Automatic1111 with git clone and copy the URL from  

  • 00:02:55 here like this and paste it into here. Right  click, paste and it will clone it. OK, it is  

  • 00:03:03 cloned inside this folder. So I will enter there  CD "s" tab and it will be automatically completed  

  • 00:03:11 like this and hit enter. Now we will check out to  certain version from here. Let me show you again.  

  • 00:03:21 Click the commits from here, and here I am moving  to the commit that I want: enable progress bar  

  • 00:03:27 without gallery. This is the commit ID. I will  also put this into the description of the video.  

  • 00:03:32 Then we are going to do Git checkout like this.  And right click paste. Now we are into that commit  

  • 00:03:44 and we are using the that specific version inside  our folder. So before starting setup, I am copy  

  • 00:03:53 pasting this web UI user bat. But because I am  going to add my command line arguments to here.  

  • 00:04:01 OK, right click the copy and edit and let me zoom  in copy paste. So I am going to use xformers,  

  • 00:04:09 no-half and disable-safe-unpickle. So how did  I come up with these command line arguments?  

  • 00:04:15 xformers going to increase your speed  significantly and reduce the VRAM usage  

  • 00:04:21 of your graphic card. No-half is necessary for  xformers to work correctly when you are using  

  • 00:04:29 SD 1.5 or 2.1 versions and disable-safe-unpickle.  According to the web UI documentation you see the  

  • 00:04:38 URL here. Command line arguments and settings:  disable checking pytorch models for malicious  

  • 00:04:44 code. Why I am using this? Because if you train  your model on a Google Colab, sometimes it is not  

  • 00:04:50 working. It is not necessary, but I am just  using it and I am not downloading any model  

  • 00:04:54 without knowing it. OK, then we save and run,  then we are going to get our fresh installation.  

  • 00:05:04 OK, like this, it will install all necessary  things. And you see, Let me zoom in It is  

  • 00:05:12 using Python 3.10.8 version. By the way, you  have to have installed Python correctly for  

  • 00:05:23 this to install. It is also showing the  commit hash that I am using like this.  

  • 00:05:30 I also need to put my Stable Diffusion models  into the models folder. So let's open it. Open  

  • 00:05:36 the State Diffusion folder, copy paste from my  previous download. And one another thing is I  

  • 00:05:43 am going to use the latest VAE file that I have  downloaded from the Internet, which I am going to  

  • 00:05:49 show you right now. So where do we put this VAE  file? Go to the stable diffusion web UI and in  

  • 00:05:54 here you will see VAE files folder. It is not  generated. It is inside the models folder and  

  • 00:06:02 inside here VAE, and this is the VAE file. Why we  are using this VAE file? Because it is improving  

  • 00:06:10 generation of person images. And now let me show  the link. OK, this is the link of the VAE file.  

  • 00:06:17 This is the latest version of VAE file. Just click  the CKPT file from here and click the download  

  • 00:06:24 button. I will also put the link of this into the  description. So if you are wondering the technical  

  • 00:06:30 description, technical details of the VAE files,  there is a long explanation in here in this  

  • 00:06:37 thread. I will also put the link of this thread  into this description of the video and there is a  

  • 00:06:42 shorter description which I liked. Each generation  is done in a compressed representation. The VAE  

  • 00:06:48 takes the compressed results and turn them into  full sized images. SD comes with a VAE already,  

  • 00:06:54 but certain models may supply a custom VAE that  works better for that model and SD 1.5 version  

  • 00:07:03 model is not using the latest VAE file. Therefore,  we are downloading this and putting that into our  

  • 00:07:10 folder. SD 2.1 version is using the latest VAE  file. And which SD 1.5 version model I am using?  

  • 00:07:20 I am using the 1.5 pruned ckpt. And where did  I download it? I have downloaded it from this  

  • 00:07:29 URL and we are using the pruned ckpt because it  is better for training than emaonly file which,  

  • 00:07:36 you see, is lesser in size. By the way, the  things I am going to show in this video can  

  • 00:07:43 be applied to any model, such as Protogen  or SD 2.1 version. Actually, I have made  

  • 00:07:51 experiments on Protogen training as well, and  I will show the results of that too to you.  

  • 00:07:59 Okay, the fresh installation has been completed.  No errors, and these are the messages displayed,  

  • 00:08:04 and it has started on this URL and I have  already opened it. You can copy and paste  

  • 00:08:11 this URL in my browser. So currently it has  selected by default the Protogen and I am going  

  • 00:08:18 to make this tutorial on version 1.5 pruned, the  official version. Okay, before starting training,  

  • 00:08:26 I am going to first settings. First going to  show you the settings that we need. Go to the  

  • 00:08:32 training tab in here and check this checkbox,  Move VAE and CLIP to RAM when training. This  

  • 00:08:40 requires a lot of RAM actually. I have 64 GB and  if you have checked this, it will reduce to VRAM,  

  • 00:08:48 which is the GPU RAM, which is our more  limited RAM. Then you can also, on check this,  

  • 00:08:56 Turn on pin_memory for DataLoader. This makes  training slightly faster, but increase memory  

  • 00:09:01 usage. I think this is increasing the RAM  usage, not VRAM usage. So you can test this.  

  • 00:09:07 In other videos. You will see that. Check this  checkbox. Use cross attention optimizations while  

  • 00:09:13 training. This will significantly increase your  training speed and reduce the VRAM usage. However,  

  • 00:09:19 it is also significantly reducing your training  success. So, if your graphic card can do training  

  • 00:09:27 without checking this out, do not check this,  because it will reduce your training success and  

  • 00:09:33 it will reduce your learning rate. How do I know  this? According to the vladmandic from Github,  

  • 00:09:41 this is causing a lot of problems. He  has opened a bug topic on the Stable  

  • 00:09:50 Diffusion web UI issues and he says that this  is causing a lot of problems. Let me show you.  

  • 00:09:59 He says that when he disabled cross attention  for training and rerun exactly the same settings,  

  • 00:10:04 the results are perfect and I can verify this.  So do not check this if your graphic card can  

  • 00:10:11 run it. There is also one more settings that we  are going to set: Save an CSV containing the loss  

  • 00:10:18 to log directory every N steps. So I am going to  make this 1. Why? Because I will show you how we  

  • 00:10:24 are going to use this during the training. Then  go to the apply settings. Okay. Then reload UI.  

  • 00:10:31 Okay, settings were saved and UI is reloaded.  Now go to the train tab. Okay. First of all,  

  • 00:10:39 we are going to give a name to our  embedding. The name is not important at all,  

  • 00:10:46 so you can give it any name. This will  be used to activate our embedding. Okay,  

  • 00:10:54 so I am going to give it a name as training  example. It can be any character length,  

  • 00:11:00 It won't affect your results or token count.  Initialization text. Now what does this mean?  

  • 00:11:07 For example, you are teaching a face and you  want it to be similar to Brad Pitt. Then you  

  • 00:11:14 can type Brad Pitt. So what does this mean?  Actually, to show you that first we are going  

  • 00:11:20 to install an extension, go to the available load  from and in here, type embed into your search bar  

  • 00:11:29 and you will see embedding inspector. This is an  extremely useful extension and let's install it.  

  • 00:11:39 Okay, the extension has been installed, so let's  just restart with this. Okay, now we can see the  

  • 00:11:50 embedding inspector. So everything in the Stable  Diffusion is composed by tokens. What does that  

  • 00:12:00 mean? You can think tokens as keywords, but not  exactly like that. For example, when we type cat  

  • 00:12:07 and click inspect, the cat is a single token  and it has an embedding ID and it has weights.  

  • 00:12:16 So every token has numerical weights, like  this. And when we do training with embeddings,  

  • 00:12:27 actually we are going to generate a new vector  that doesn't exist in the stable diffusion. We  

  • 00:12:34 are going to do training on that. So when you  set initialization text like this, by the way,  

  • 00:12:42 it is going to generate a vector with the weights  of this. However, this is a two token. How do I  

  • 00:12:51 know, go to go to the embedding inspector and  type Brad. So you see, Brad is a single token.  

  • 00:12:57 It has weights. And let's type Pitt, and Pitt  is also another token and it has also vector.  

  • 00:13:05 So these weights will be assigned initially to  our new vector. However, we have to use at least  

  • 00:13:14 two vectors, otherwise we wouldn't be able to get  two vectors. So if we start our training with Brad  

  • 00:13:24 Pitt, our first initial weights will be according  to the Brad Pitt and our model will learn upon  

  • 00:13:32 that. Is this good? If your face is very similar  to Brad Pitt, yes, but if it is not, no. So  

  • 00:13:42 Shondoit from the automatic community has made  an extensive experimentation and he found that  

  • 00:13:54 starting with zero initialization text as empty.  So we will start with zeroed vectors is performing  

  • 00:14:04 better than starting with, for example, *. Because  * is also another token and you can see it from  

  • 00:14:13 here. Just type * here. It is just some vectors  like this. So starting with empty vectors is  

  • 00:14:21 better. And now the number of vectors per token.  So everything, every token has a vector in the  

  • 00:14:30 stable diffusion, and you may wonder how many  tokens there are To find out that we are going to  

  • 00:14:38 check out the clip vit large patch. So in here you  will see the tokenizer json. Yes, inside this json  

  • 00:14:45 file all of the tokens are listed. So you see, let  me show, there is word IDs and words themselves,  

  • 00:14:56 like here: you see yes. So the list is starting  from here. So each one of these are tokens and  

  • 00:15:04 it goes to the bottom like this: For example,  sickle, whos, lamo, etour, finity. So these are  

  • 00:15:12 all of the tokens, all of the embeddings that the  stable diffusion contains. If you wonder how many  

  • 00:15:19 there are exactly, there are exactly 49408 tokens  and each contain one vector. For SD 1.x versions,  

  • 00:15:33 it is 762 vector size and for SD 2 version,  it is 1024. So when we do embedding inspector,  

  • 00:15:45 you see it is showing the vector. So  everything is composed by numerical weights  

  • 00:15:50 and they are being used by the machine learning  algorithms to do inference and so also every  

  • 00:16:00 prompt we type is getting into tokenized and  I will also show that tokenization right now.  

  • 00:16:06 Before we start to do that, go to the available  tab load here and search for token and you will  

  • 00:16:13 see there is tokenizer, like tokenizer extension.  Just install it, restart the UI and now you will  

  • 00:16:22 see tokenizer. So type your prompt here and see  how it is getting tokenized. So let's say I am  

  • 00:16:29 going to use this kind of prompt. It is showing  in the web UI that fifty eight tokens are being  

  • 00:16:37 used and we are limited to seventy five tokens.  But we are not using fifty eight words here.  

  • 00:16:44 If you count the number of words it is not fifty  eight. So let's copy this and go to the tokenizer,  

  • 00:16:50 paste it and tokenize, and now it is showing all  of the tokenization. So the face is a single token  

  • 00:16:57 with ID of 1810. Photo is single token of a single  token and let's see: OK, So the artstation is two  

  • 00:17:06 tokens. It is art and station. Comas are also one  tokens, as you can see, and let's see if there is  

  • 00:17:13 any other being tokenized into some tokens.  Or photorealistic. Photorealistic is also two  

  • 00:17:21 tokens and artstation is two tokens. So this is  how tokenization works. Each of these tokens have  

  • 00:17:29 their own vectors and you can see their weights  from embedding inspector. However, it is not very  

  • 00:17:35 useful because these numbers doesn't mean anything  individually, but in the bigger scheme they  

  • 00:17:42 are working very well with the machine learning  algorithms. Machine learning is all about weights.  

  • 00:17:50 Also, in the official paper of textual inversion,  on the page four, you see they are showing a photo  

  • 00:17:58 of a star, which is our embedding name. So you see  there is a tokenizer and token IDs and they have  

  • 00:18:07 vectors like this. So it is all about vectors  and their weights. OK. Now we can return to  

  • 00:18:14 train tab. Now we have idea of our tokenization.  So let's give a name as tutorial training. You  

  • 00:18:23 can give this any name. This will be activation.  Initialization text. I am just leaving it empty to  

  • 00:18:30 obtain best results. So our vectors with will  start with zero. Let's say you are training  

  • 00:18:37 a bulldog image, so you can start with bulldog  weights. So it will make your, it may make your  

  • 00:18:46 training better. However, for faces since we are  training a new face that the database has no idea,  

  • 00:18:54 I think leaving it is. Leaving it as empty is  better. So, number of vectors: now you know that  

  • 00:19:01 each token has one vector, which means that when  we type Brad Pitt, only two vectors are used for  

  • 00:19:10 that. So all of the Brad Pitt images are saved in  the stable diffusion model with just two vectors,  

  • 00:19:18 which means that two vectors is a good number  of , is a good number for our face training or  

  • 00:19:29 for our subjects training. I also have made a  lot of experiments with one vector, two vector,  

  • 00:19:35 three, vector four vector, and I have found that  two vectors are working best. However, this is  

  • 00:19:42 based on my training data set. You can also try  one, two, three, four, five and you will see that  

  • 00:19:49 the quality is decreasing as you increase the  number of vectors. Also, in the official papers  

  • 00:19:55 the researchers have used up to three vectors. You  see extended latent spaces. This is the number of  

  • 00:20:01 vector count that is derived from the official  paper and they have used up to three. You see  

  • 00:20:07 detonated two words and three words, but it is  up to you to do experimentation and I am going to  

  • 00:20:14 use two. If you write overwrite old embedding, it  will override if there is embedding like this. So  

  • 00:20:20 let's click create embedded and it is created. So  where it is, saved. Go to your folder installation  

  • 00:20:30 folder and in here you will see embeddings and in  here we can see already our embedding is composed.  

  • 00:20:37 Then let's go to the preprocess images. So this is  a generic tab of web UI. It provides you to crop  

  • 00:20:48 images, create flipped copies, split oversized  images. Auto focal point crop, use BLIP for  

  • 00:20:53 caption, use deepbooru for for captioning. There  is source directory and destination directory.  

  • 00:21:00 So I have a folder like this for experimentation  and showing I am copying its address like this  

  • 00:21:07 and pasting it in here as source and I am going  to give it a destination directory like a1. They  

  • 00:21:15 are going to be auto-resized and cropped. So let's  check. Let's check this checkbox. Create flipped  

  • 00:21:22 copies. By the way, for faces, I am not suggesting  to use this. It is not improving quality. You can  

  • 00:21:28 also split oversized images, but this doesn't  make sense for faces. Autofocal point: yes,  

  • 00:21:34 let's just also click that. Use BLIP for  caption. So it will use BLIP algorithm for  

  • 00:21:40 captioning. This is better for real images and  deepbooru is better for i think anime images. OK,  

  • 00:21:48 and then just let's click preprocess. By the way,  why we are doing 512 and 512? Because version 1.5,  

  • 00:21:58 Stable Diffusion version 1.5 is based on 512  and 512 pixels. If you use version 2.1 Stable  

  • 00:22:07 Diffusion. Then it has both 512 pixels and 768  pixels. So you need to process images based on  

  • 00:22:19 the model native resolution. Based on the model  that you are going to do training. In the training  

  • 00:22:26 tab it will use the selected model here. So  be careful with that. And when the first time  

  • 00:22:32 when you do preprocessing, it is downloading the  necessary files as usual. OK, the processing has  

  • 00:22:38 been finished. Let's open the processed folder  from pictures and a2 folder, a1 folder. And now  

  • 00:22:45 you see there are flipped copies and they were  automatically cropped to 512 and 512 pixels. And  

  • 00:22:52 there are also descriptions generated by the BLIP.  When you open the descriptions, you will see like  

  • 00:22:58 this: a man standing in front of a metal door in  a building with a blue shirt on and black pants.  

  • 00:23:05 So now we are ready with the preprocess images,  we can go to the training tab. In here we are  

  • 00:23:13 selecting the embedding that we are going to  train, embedding learning rate. There are various,  

  • 00:23:20 let's say, discussions on this learning rate, but  in the official paper, 0.005 is used. Therefore,  

  • 00:23:29 I believe that this is the best learning rate.  The gradient clipping is related to hyper network  

  • 00:23:36 learning rate, a hyper network training, so just  don't touch it. So the batch size and gradient  

  • 00:23:42 accumulation size. This is also explained in  the official paper. The batch size and gradient  

  • 00:23:49 accumulation steps will just increase your  training speed if you have sufficient amount  

  • 00:23:54 of RAM, VRAM memory. However, make sure that  the number of training images can be divided  

  • 00:24:00 the multiplication of these both numbers. So let's  say you have 10 training images, then you can set  

  • 00:24:10 these as 2 batch size and 5 gradient accumulation,  which is two multiplied by five, is equal to 10,  

  • 00:24:20 or they can be. Or let's say you have 40 training  images, then you can set it as 20 or 10 or 5,  

  • 00:24:28 it is up to you. However, this will increase,  significantly, increase your VRAM usage. And  

  • 00:24:35 let's say the multiplication of these two numbers  is equal to 10. Then you should also multiply  

  • 00:24:42 learning rate with 10. Why? Because this requires  learning rate to be increased. How do I know that?  

  • 00:24:51 In the official paper, in  the implementation details,  

  • 00:24:55 they say that they are using two graphic cards  with batch size of four. Then they are changing  

  • 00:25:03 the base learning rate with multiplying by  by eight. Why? Because two graphic cards,  

  • 00:25:08 batch size for four multiplied by two is eight,  and when you multiply 0.005 with 8, then we obtain  

  • 00:25:17 0.04. So be careful with that. If you increase  this batch size and gradient accumulation steps,  

  • 00:25:25 just also make sure that you are increasing the  learning rate as well. However, for this tutorial  

  • 00:25:31 I am going to use batch size one and gradient  accumulation steps as one. Actually, until you  

  • 00:25:37 obtain good initial results, don't change them,  I suggest you. Then you can change them. Then you  

  • 00:25:44 need to set your training data set directory.  So let's say I am going to use these images,  

  • 00:25:51 then I am going to set them. Also, there is log  directory, so the training logs will be logged  

  • 00:25:59 in this directory where it is. When we open our  installation folder, we will see that there is a  

  • 00:26:11 textual inversion. However, since we still didn't  start yet, it is not generated. So when the first  

  • 00:26:18 time we start, it will be generated. I  suggest you to not change this. Okay,  

  • 00:26:23 prompt template. So what are prompts templates?  Why are they used? Actually, there is not a clear  

  • 00:26:31 explanation of this in the official paper.  When you go to the very bottom, you will see  

  • 00:26:38 training prompt templates. So these templates are  actually derived from here. From my experience,  

  • 00:26:46 I have a theory that these prompts are used like  this. So let's say you are teaching a photo of a  

  • 00:26:55 person, then this, the vectors of these tokens  are also used to obtain your target image. So  

  • 00:27:04 they are helping to reach your target image. This  is my theory. So it is using the vector of photo,  

  • 00:27:14 it is using a vector of a of, or you are teaching  a style, then it is using that. So these templates  

  • 00:27:23 are actually these ones. When you open the prompt  template folder which is in here, let's go to the  

  • 00:27:32 textual inversion temples and you will see the  template file files like this. So let's say, when  

  • 00:27:38 you open subject file words, it will, you will get  a list of, like this: a photo of name and file,  

  • 00:27:44 or so. The name is the activation name that we  have given. It will be treated specially. It will  

  • 00:27:52 not get turned into a regular token. For example,  tutorial training would be tokenized like this if  

  • 00:28:00 it was not an embedding tutorial: training. Let's  click tokenize. You see the tutorial training.  

  • 00:28:07 Actually three tokens. Tutorial is a tokenized,  like tutor, ial and training. However, since  

  • 00:28:16 it will be our special embedding name, therefore  it will be treated as with the number of special  

  • 00:28:25 tokens and it will be based on the number of  vectors per token. We decided, if we decide this,  

  • 00:28:33 to take 10, then it will use 10 token space from  our prompting, so it will take 10 space in here.  

  • 00:28:42 However, it will now take only two instead of  three because it will be specially treated. Okay,  

  • 00:28:51 let's go back to the train tab. So? So this name  is the. Sorry about that. This name is the name of  

  • 00:29:01 our embedding name and the filewords. So the  filewords is the description generated here.  

  • 00:29:08 So, basically, the prompt for training will  become, tutorial training, and the file words,  

  • 00:29:16 let's say it is training for this particular  image. It will just get this and append it  

  • 00:29:22 here and this will become the final prompt  for that image when doing training. So, what  

  • 00:29:30 should we? How should we edit this description?  You should define the parts that you don't want  

  • 00:29:39 model to learn. Which parts i don't want model to  learn? I don't want model to learn this clothing,  

  • 00:29:45 these walls, for example, or the this item here.  So i have to define them as much as possibly.  

  • 00:29:53 So if i want model to learn glasses,  then i need to remove glasses, okay,  

  • 00:29:58 and for example, if i want model to learn my  smile, i should just remove it. Okay, i want my,  

  • 00:30:06 i want model to learn my face. Therefore, i can,  i can just remove it, and this is so on. However,  

  • 00:30:15 i am not going to use file words in this  training, because i have found that if you  

  • 00:30:21 pick your training data set carefully, you don't  need to use filewords. So, how am i going to do  

  • 00:30:29 training in this case? I am just going to generate  a new text file here and i will say: my, special,  

  • 00:30:41 okay, let's just open it. And here i am just going  to type [name]. You have to use at least name,  

  • 00:30:47 otherwise it won't work. It will throw an error,  and i am not going to use, filewords. Also,  

  • 00:30:55 i am not going to use myself in this training.  I am going to use one of my followers. He had  

  • 00:31:03 he had sent me his pictures. Let me show you the  original pictures he had sent me. Okay, this is  

  • 00:31:10 the images he had sent me. However, i didn't use  all of them. You see the images right now. There  

  • 00:31:17 are different angles and, different backgrounds.  When you are doing textual inversion, you should  

  • 00:31:25 only teach one subject at a time, but if you want  to combine multiple subjects, then you can train  

  • 00:31:32 multiple embeddings and you can combine all of  them when using, when do, when generating images.  

  • 00:31:38 So which ones i did pick, let me show you. I have  picked these ones, okay, and now you will notice  

  • 00:31:46 something here. You see, the background is here,  is like this. You see green and some noise. Why?  

  • 00:31:55 Because i don't want model to learn background.  So if multiple images containing same background,  

  • 00:32:01 i am just noising out those backgrounds. And why  I did not noise out to other backgrounds? Because  

  • 00:32:07 other backgrounds are different. So you see, in  your training data set, only the subject should  

  • 00:32:14 be same and all other things need to be different,  like backgrounds, like clothing and other things.  

  • 00:32:21 So the training will learn only your subject,  in this case the face. It will not learn the  

  • 00:32:28 background or the clothing. Okay, so let me show  the original one. So in the original one you see  

  • 00:32:34 this image, this image, this image and these two  images have same backgrounds. So i have edited  

  • 00:32:40 those same backgrounds with paint .NET, which is  a free editing software. You can also edit with  

  • 00:32:47 paint. How did i edit it? It is so actually simple  and amateur, you may say. So let's set a brush  

  • 00:32:55 size here and just, for example, change the color  like this. Then i did added some noise: select it  

  • 00:33:04 with a selection tool, set the tolerance from  here and go to the effects, adjust effects and  

  • 00:33:11 in here you will see distort and frosted glass and  when you click it it will change the appearance.  

  • 00:33:20 You can also try other distortion. By the  way, i am providing these images to you for  

  • 00:33:27 testing. Let me show you the link. So i have  uploaded images into a google drive folder  

  • 00:33:33 and i am going to put the link of this into  the description so you can download this data  

  • 00:33:38 set and do training and see how it performs.  Are you able to obtain good results, as me?  

  • 00:33:45 Okay, so i am going to change my training data  set folder from pictures and i am going to use  

  • 00:33:55 example training set folder. I am going to set it  in my training here. Okay, and i am going to use  

  • 00:34:05 my prompt template. Just refresh it and go to the  my special. So what was my special? My special was  

  • 00:34:12 just only containing [name]. It is not containing  any file descriptions. I have found that this is  

  • 00:34:18 working great if you optimize your training data  set, as, like me, you can try both of them. You  

  • 00:34:26 can try with [filewords] and you can try without  [filewords]. [filewords] and you can see how it  

  • 00:34:33 is working. Okay, do not resize images, because  our images are already 512 pixels max steps. Now,  

  • 00:34:40 this can be set anything. I will show you a way  to understand whether you started over training  

  • 00:34:48 or not, so this can be stay like this. In how  many steps we want to save? Okay, this is rather  

  • 00:34:56 different than epochs in the DreamBooth, if you  have watched my DreamBooth videos. So each image  

  • 00:35:03 is one step and there is no epoch saving here.  It is step saving. How many training images i  

  • 00:35:10 have. I have total 10 images, therefore, okay, so  for 10 epochs we need to set this 100, actually,  

  • 00:35:18 okay. So the formula is like this: one epoch  equal to number of training images. 10 epoch  

  • 00:35:24 for 10 training images is 10 multiplied by 10  is 100, so it will be saved every 10 epoch.  

  • 00:35:31 Save images with embedding in png chunks. This  will save embed. This will save embedding info  

  • 00:35:41 in the generated preview images. I will show  you. Read parameters from text to image tab  

  • 00:35:48 when generating preview images. I don't want that,  so it will just use the regular prompts, that is,  

  • 00:35:57 that we will see in here. Shuffle tags by comma,  which which means that if you use file words,  

  • 00:36:05 the words in there will be shuffled when doing  training. This can be useful. You can test it  

  • 00:36:11 out and drop out tags when creating prompts.  This means that it will randomly drop the  

  • 00:36:18 file descriptions, file captions,  that you have used. This is, i think,  

  • 00:36:23 percentage based. So if you set it 0.1, it will  drop out randomly the 10 percent, and i am not  

  • 00:36:30 going to use file words. Therefore, this will have  zero effect. Okay, choose latent sampling methods.  

  • 00:36:36 I also have searched this. In the official  paper. Random is used. However, one of the  

  • 00:36:43 community developer proposed deterministic and  he found that deterministic is working best. So  

  • 00:36:49 choose deterministic. And now we are ready so we  can start training. So i am going to click train  

  • 00:36:57 embedding. Okay, training has started, as you can  see, and we are. It is displaying the number of  

  • 00:37:05 epochs and number of steps. It is displaying the  iteration, so currently it is 1.30 seconds for  

  • 00:37:14 per iteration. Why? Because i am recording and  it is taking already a lot of gpu power. I have  

  • 00:37:21 RTX 3060. It has 12 gigabyte of memory. Let me  also show you what is taking the memory usage.  

  • 00:37:32 You see, OBS studio is already using a lot of gpu  memory and also training uses. But since they are  

  • 00:37:39 using different parts of the gpu, i think it  is working fine. When we open the performance,  

  • 00:37:43 we can see that the training is using the 3d part  of the gpu and obs is using the video encode part  

  • 00:37:52 of the gpu. That is how i am still able to record,  but sometimes it is dropping out out my voice. I  

  • 00:37:59 hope that it is currently recording very well.  Okay, and i also did some overclocking to my gpu  

  • 00:38:10 by using MSI Afterburner. I have increased  the core clock by 175 and i have increased  

  • 00:38:18 memory clock by 900, so this boosted my training  speed like 10%. You can also do that if you want.  

  • 00:38:26 I didn't do any core voltage increasing. So it  has already generated two preview images where  

  • 00:38:34 we are going to find them. Let me show you. Now  it will be inside textual inversion folder. You  

  • 00:38:42 see it has just arrived and when you open it you  will see the date, the date of the training, and  

  • 00:38:51 you will see the name of the embed we are going  training and in here you will see embeddings. This  

  • 00:38:56 is the checkpoint. So you can use any checkpoint  to do to generate images and, this is the images  

  • 00:39:04 that it has generated. So this is the first  image and also in image embeddings. This image  

  • 00:39:11 embedding contains the embedding info. Why this  is generated? Because we did check this checkbox.  

  • 00:39:19 You will see the used prompts here. Since i  didn't use any file words and i just used name,  

  • 00:39:25 it is only using this name as a prompt. And what  does that mean? That means that it is only using  

  • 00:39:34 the vectors we have generated in the beginning  to learn our subject, to learn the details of  

  • 00:39:40 our subject, which is the face, and we have  two vectors to learn, and also Brad Pitt is  

  • 00:39:48 based on the two vectors, so why not? We can  be also taught to the model with two vectors.  

  • 00:39:56 Okay, in the just in the 20th epoch,  we already getting some similarity.  

  • 00:40:03 Actually, i already did the same training,  so i already have the trained data.  

  • 00:40:12 But i am doing recording while training for  you again, for to explain to you better.  

  • 00:40:21 It also shows here the estimated time, for  training to be completed. This time is based on  

  • 00:40:28 100 000 steps, but we are not going to train that  much. Actually, i have found that around three  

  • 00:40:35 thousand steps. We are getting very good results,  with the training data set i have. It will totally  

  • 00:40:42 depend on the training data set you have with how  many number of steps you can teach your subject.  

  • 00:40:48 I will show you the way how to determine which one  is best, which checkpoint is best, which number of  

  • 00:40:56 steps is best. Okay, with 30 epoch we already got  a very much similar image. You see, with just 30  

  • 00:41:06 epoch we are starting to get very similar images.  It is starting to learn our subject very well  

  • 00:41:12 with just 30 epoch and when we get over 100  epoch, we will get much better quality images.  

  • 00:41:20 Okay, it has been over 60 steps,  600 steps and over 60 epochs,  

  • 00:41:26 and we got six preview images. Since we are  generating preview images and checkpoints,  

  • 00:41:32 for every 10 epoch. Now i am going to show you how  you can determine whether you are overtraining or  

  • 00:41:40 not with a community developed script. So the  script name is: inspect embedding training.  

  • 00:41:50 It is hosted on github. It's a public project.  I will put the link of this project to the  

  • 00:41:55 description as well. Everything, every link,  will be put to the description. So check out  

  • 00:41:59 the video description and in here, just click  code and download as zip. Okay, it is downloaded.  

  • 00:42:06 When you open it you will see inspect embedding  training part. Extract this file into your textual  

  • 00:42:14 inversion and tutorial training, as i have shown.  So you will see these files there. To extract it,  

  • 00:42:21 just drag and drop. Why we are extracting it in  here? Because we are going to analyze the loss.  

  • 00:42:28 And so the loss, what is loss? You are always  seeing the loss here. The number value is here:  

  • 00:42:37 loss is the penalty for a bad prediction. That  is, that is loss is a number indicating how bad  

  • 00:42:43 the model's prediction was on a single example.  If the model's prediction is perfect, the loss is  

  • 00:42:49 zero. Otherwise the loss is greater. In our case  we can think that as the model generated image,  

  • 00:42:56 how likely, how close to our training subjects,  training images. So if you get a zero loss,  

  • 00:43:05 that means that model is learning very good,  okay. If your loss is too high, that means that  

  • 00:43:11 your model is not learning. Now, with this script  we have extracted here, we are going to see the  

  • 00:43:20 loss. And how are we going to use this script?  This script requires torch installation and  

  • 00:43:28 the torch is already installed in our web ui  folder, inside venv folder, virtual environment,  

  • 00:43:36 and inside here scripts. So we are going to use  the python exe here to do that. First copy the  

  • 00:43:43 path of this. Open a notepad file like this: okay,  put quotation marks and just type python exe like  

  • 00:43:54 this: okay, then we are going to get the path  of the file. Let me show you. The script file  

  • 00:44:02 is in this folder. So, with quotation marks,  just copy and paste it in here and type the  

  • 00:44:12 script file name like this: then open  a new cmd window by typing like this:  

  • 00:44:19 okay, let me some zoom in, copy and paste the  path like this, the code, and just hit enter  

  • 00:44:28 and you will see it has generated some  info for us: learning rate at step at,  

  • 00:44:34 loss jpg, vector, vector jpg and the average  vector strength. So let's open our folder in  

  • 00:44:42 here and we will see the files. When we open the  loss file we are going to see a graph like this:  

  • 00:44:50 an average loss is below 0.2, which means it is  learning very well. Why, as close as it is to 0,  

  • 00:44:58 it is better, so as close as it is to 1,  it is worse. So currently we are able to  

  • 00:45:04 learn very well. Now i will show you how  to determine the over training or not.  

  • 00:45:13 To do that, we are going to add a parameter  here, --folder, and just give the folder of  

  • 00:45:21 embedding files here. Just copy paste it again  do not forget quotation marks and open a new  

  • 00:45:28 cmd window. Just copy and paste it, hit enter.  It will calculate the average strength of the  

  • 00:45:37 vectors and when this strength is over 0.2, that  usually means that you started over training. How  

  • 00:45:46 do we know? According to the developer of this,  this script, if the strength of the average  

  • 00:45:54 strength of the all vectors is greater than 0.2,  the embedding starts to become inflexible. That  

  • 00:46:02 means over training. So you will not be able  to stylize your trained subject. So you won't  

  • 00:46:14 be able to get good images like this if you  do over training, if you're, if the strength  

  • 00:46:20 of the vectors becomes too weak. And what was the  vector strength? It was so easy. When we opened,  

  • 00:46:29 the embedding tab, the embedding inspector tab,  we were able to see the values of vectors. So this  

  • 00:46:37 strength means that the average of these values,  when they, when the average of these values is  

  • 00:46:42 over 0.2 that means that you are starting to  do over training. You need to check this out  

  • 00:46:50 to determine that. By the way, it is said  that the DreamBooth is best to teach faces,  

  • 00:46:58 and in the official paper of embedding, the  textual inversion, the authors, the researchers,  

  • 00:47:04 have used all you see objects like this, or they  have used training on style, let me show you like  

  • 00:47:13 here. However, as i have just shown, as i have  just demonstrated you this textual embeddings are  

  • 00:47:22 also very good, very successful, for teaching  faces as well, and for objects, of course,  

  • 00:47:29 it is working very well as well. And for styles. I  think the textual inversion, the text embeddings,  

  • 00:47:36 is much better than DreamBooth. So if you want  to teach objects or styles, then i suggest you  

  • 00:47:44 to use textual inversion. Actually for faces, i  think textual inversion of the automatic1111 is  

  • 00:47:51 working very well as well. And for DreamBooth to  obtain very good results, you need to merge your  

  • 00:47:59 learned subject into a new model, which i have  shown in my video. So if you use dream boot,  

  • 00:48:05 you should inject your trained subject into a  good custom model to obtain very good images.  

  • 00:48:10 But on textual inversion, you can already obtain  very good images. Okay, we are over 170 epoch and  

  • 00:48:20 meanwhile training is going on. I will show you  the difference of DreamBooth, textual inversion,  

  • 00:48:27 LoRA and hypernetworks. One of the community  member of reddit, use_excalidraw, prepared  

  • 00:48:36 an infographic like this and this is very useful.  So in DreamBooth, we are modifying the weights of  

  • 00:48:43 the model itself. So all of the prompt words  we use you know. You already know by now that  

  • 00:48:52 they all, they all have vectors of them, each of  them, and these vectors are getting modified in  

  • 00:48:59 DreamBooth, all of them. The token we selected  for DreamBooth is also getting modified and in  

  • 00:49:06 DreamBooth we are not able to add a new vector.  We already have to use one of the existing vectors  

  • 00:49:15 of the model. Therefore, we are selecting  one of the existing tokens in the models,  

  • 00:49:20 such as sks or ohwx. So in DreamBooth we are  basically modifying, altering the model itself.  

  • 00:49:32 Okay, in Textual Inversion we are adding a new  token. Actually, this is displayed incorrectly  

  • 00:49:39 because it is generating a unique new vector which  does not exist in the model, and we are modifying  

  • 00:49:50 the weights of these new vectors. So when we set  the vector count as two, it is actually using two  

  • 00:49:58 unique new tokens. So it is modifying two vectors.  If we set the vector count 10, it is using 10  

  • 00:50:06 unique tokens. It is being specially treated, it  is adding new 10 vectors and it is not modifying  

  • 00:50:15 any of the existing vector of the model. So if we  set the vector count to 10, actually when we do,  

  • 00:50:25 when we generate an image in here, it will  use 10 vectors. It will use 10 tokens out  

  • 00:50:31 of 75 tokens we have. We have 75 tokens limit. So  this is how it works. Also, if you use 10 vector,  

  • 00:50:41 you will see that you are getting very bad  results for face. I have made tests. Tests. Okay,  

  • 00:50:47 in LoRA it is. This is very similar to the  DreamBooth. It is modifying the existing vectors  

  • 00:50:55 of the model. It is. I have found that the LoRA is  inferior to the DreamBooth, but it is just using  

  • 00:51:05 lesser VRAM and it is faster. Therefore, people is  choosing that. However, for quality, DreamBooth is  

  • 00:51:12 better in as shown here, and the hypernetworks.  Hypernetworks doesn't have an official academic  

  • 00:51:21 paper. I think it is made upon a leaked code  and this is the least successful method. It  

  • 00:51:29 is the worst quality, so just don't waste time  with it. It is so i don't suggest to use it.  

  • 00:51:38 So in hypernetworks, the original weights,  original vectors of the model is not modified,  

  • 00:51:44 but at the inference time. Inference means that  when you generate an image from text to image,  

  • 00:51:50 it is inference. They are just getting swapped  in. So you see there are some images which are  

  • 00:51:58 training sample, apply noise compare and  there is loss. So this is how the model  

  • 00:52:06 is learning. Basically, of course, there are a  lot of details if you are interested in them,  

  • 00:52:11 you can just read the official paper, but it is  very hard to understand and complex thing things.  

  • 00:52:19 Okay, we are over 200 epochs, so we have 20  example images and the last one is extremely  

  • 00:52:27 similar to our official training set, as you  can see. So let's also check out our strength  

  • 00:52:34 of the training vector. So i am just typing,  hitting the up arrow in my keyboard, and it is  

  • 00:52:43 retyping the last executed command and hit enter.  Okay, so our strength, average strength, is 0.13,  

  • 00:52:53 actually almost 0.14. We are getting close to 0.2.  After 0.2, we can assume we started over training.  

  • 00:53:02 Of course, this would depend on your training  data set, but it is an indication according  

  • 00:53:08 to the experience of the this developer. It also  makes sense because as the strength of the vector  

  • 00:53:16 increases, it will override the other vectors. You  see, since they are all floating point numbers,  

  • 00:53:27 numeric numbers, the bigger numeric number is  usually making ineffective the lower numeric  

  • 00:53:36 numbers. This is how machine learning usually  works, according to the chosen algorithms. They  

  • 00:53:41 are extremely complex stuff, but this is one of  the, let's say, common principles that in the many  

  • 00:53:49 of the numerical weights based machine learning  algorithms. Therefore, it also makes sense.  

  • 00:53:57 Okay, we are over 500 epochs at the moment. So  let me show you the generated sample images. These  

  • 00:54:05 are the sample images. We are already very similar  and the latest one is, you see, looks like getting  

  • 00:54:11 over trained. So let's check out with the script  we have. Just hit the up arrow and hit enter  

  • 00:54:20 and you see, we are now over 0.2 strength.  Therefore, I am going to cancel the training and  

  • 00:54:28 now I will show you how to use these embeddings.  But before doing that, first let's set the newest  

  • 00:54:36 VAE file to generate better quality images.  To do that, let's go to the let me find it  

  • 00:54:49 okay go to the Stable Diffusion tab  in the settings and in here, you see,  

  • 00:54:54 SD VAE is automatic. I am going to select the  one we did put. Let's apply settings, okay,  

  • 00:55:02 and then we will reload to UI. Okay, settings  applied and UI is reloaded. So how are we going to  

  • 00:55:11 use these generated embeddings? It is easy. First  let's enter to our textual inversion directory and  

  • 00:55:21 inside here let's go to the embeddings folder.  Let me show you what kind of path it is. I know  

  • 00:55:29 that it is looking small, so this is where I  have installed my automatic1111. This is the  

  • 00:55:40 main folder: textual inversion. This is the date  of the training, when it was started. This is the  

  • 00:55:46 embedding name that I have given and this is the  folder where the embedding checkpoints are saved.  

  • 00:55:53 When we analyze the weights, we see the  change it has. So I am going to pick 20  

  • 00:56:01 of them to compare. How am I gonna do that?  I will pick with 200 per epoch, like this:  

  • 00:56:10 okay, I have selected 24. Right click copy. By  the way, for selecting each one of them, I have  

  • 00:56:18 used control button. You can select all of them.  It is just fine. Then move to the main folder,  

  • 00:56:23 installation, and in here you will see embeddings  folder. Go there, I'm just going to delete  

  • 00:56:28 the original one and I am pasting the ones as  checkpoints. So how we are going to use them, just  

  • 00:56:37 type their name like this: this is equal to OHWX  in the, in the DreamBooth tutorials that we have  

  • 00:56:45 and let's see. Currently it says that it is  using seven prompts, but this is not correct  

  • 00:56:53 actually. It should be using just two. Okay,  maybe it didn't refresh. Let's do a generation.  

  • 00:57:03 Okay, we got our picture. I think this is taking  seven length because it is also using the okay,  

  • 00:57:15 yeah, so okay, now fixed. Now you see it is using  only two tokens. Why? Because now it has loaded  

  • 00:57:25 the embedding file name and our embedding  was composed with two vectors. Therefore,  

  • 00:57:34 it is using two vectors. However, if this was  not our embedding name, it, if it was, was just a  

  • 00:57:42 regular prompt. If we go to the tokenizer we can  see it was going to take. Let me show you one,  

  • 00:57:50 two, three, four, five, six, seven, eight tokens.  You see each number is a token. This is a token.  

  • 00:57:58 So it was going to use eight tokens, but since it  is an embedding name and the embedding is only two  

  • 00:58:05 vectors, it is using only two tokens because  in the background, in the technical details,  

  • 00:58:13 it has composed of two unique tokens, since we  did set the vector count 2. So for each vector a  

  • 00:58:20 token is generated and with a textual embedding  we are able to insert, we are able to generate  

  • 00:58:28 new tokens, unlike DreamBooth. DreamBooth can only  use the existing tokens. Okay, so now we are going  

  • 00:58:35 to generate a test case with using X/Y plot. I  have tested CFG values and the prompt strength.  

  • 00:58:45 So from prompt strength i mean that prompt  attention emphasis and it is explained in the wiki  

  • 00:58:51 of the Stable Diffusion of automatic1111 web ui.  So you see, when you use parentheses like this,  

  • 00:58:57 it increases the attention toward by factor of  1.1. You can also set directly the attention like  

  • 00:59:03 this. So i have tested with embeddings, the prompt  attention and it. It always resulted bad quality  

  • 00:59:12 for me, but still you can test with them. I also  played with the CFG higher values. They were also  

  • 00:59:18 not very good, but now i will show you how to  test each one of the embeddings. So instead of  

  • 00:59:26 typing manually, manually each one of the name,  i have prepared a public cs fiddle script. I will  

  • 00:59:34 also share the link of this script so you will be  also able to use it. So the starting checkpoint:  

  • 00:59:41 the starting checkpoint is 400, so let's set it  as 400. Our increment is 200. We have selected  

  • 00:59:48 and our embedding name is tutorial name. Okay, so  let's just type it in here. Then just click run  

  • 00:59:58 and you see it has generated me all of the names.  I have names up to 5000. I copy them with a ctrl  

  • 01:00:07 c or copy. Then we are going to paste it in here  in the X/Y plot and in here select the prompt sr.  

  • 01:00:15 Then we need to set a keyword. Okay, let's set a  keyword as kw. Okay, test, it is not important,  

  • 01:00:26 you can simply set, set anything here. Now i  will copy and paste some good prompts. To do  

  • 01:00:33 that i will use png info, drag and drop. Okay,  i have lots of experimentation. As you can see,  

  • 01:00:42 these experimentation are from protogen  training with textual embeddings. It was  

  • 01:00:47 also extremely successful for my face. Okay, let's  pick from my today, experimentation which is under  

  • 01:00:59 okay, under here. Let's just pick one of  them. Okay, now i am going to copy paste  

  • 01:01:06 this into text to image tab. You see,  when you use png info, it shows all of  

  • 01:01:11 the parameters of the selected picture, if they  are select, if they are generated by the web,  

  • 01:01:19 ui by default. Okay, so you see. Face  photo of. Let me zoom in like this:  

  • 01:01:27 testp2400. This is from my previous embedding,  so it is currently 60 tokens. Now i am going to  

  • 01:01:34 replace these with my test keyword. They will  be replaced with these all of the tutorial  

  • 01:01:43 training text which are my embedding names, and  prompt sr. Okay, as a second parameter. You see,  

  • 01:01:51 now it is reduced to 55. You can try CFG values  actually, if you want. Or you can try the prompt  

  • 01:02:01 strength, prompt emphasis. To do that just at  another keyword here as another kw. Okay, and  

  • 01:02:11 let's put it in as a prompt strength, actually,  not more strength, but prompt sr, okay, and  

  • 01:02:19 replace it with 1.0 and 1.1, for example. So you  will see the results of different prompt emphasis,  

  • 01:02:30 attention emphasis, as explained here. You can  test them. You can also test the CFG values.  

  • 01:02:35 It is totally up to you. Do not check this  box, because you will. You want to see the  

  • 01:02:42 same seed images. Actually, since these are  different checkpoints, you are not going to  

  • 01:02:47 get the same image. By the way, when we use the  command argument in here, let me show you. When  

  • 01:02:58 we use xformers, even if you use the same seed,  you, you will not get the same image because,  

  • 01:03:07 since this is doing a lot of optimization, it  will not allow you to get the exactly same image,  

  • 01:03:14 even if you use the same model and the same  seed.. And also, there is one more thing:  

  • 01:03:21 actually there are two more options. If you  have low vram. Let me show you. So in command  

  • 01:03:29 line arguments of the wiki, if we search for  VRAM, let's see like this: you will see there  

  • 01:03:37 is medvram, medium vram and low vram. So if you  also add these parameters to your command line  

  • 01:03:45 arguments like this. Let me show, okay, medvram  and lowvram. It will allow you to run the web  

  • 01:03:55 ui on a lower vram gpu. and with lowvram and  medvram you can still generate images with very  

  • 01:04:04 low amount of gpu. However, when you use low vram,  it will not allow you to do training. So you can  

  • 01:04:13 add medvram to your command line argument and this  will allow you to do training, textual embedding,  

  • 01:04:20 textual inversion training on a lower vram having  gpu. Okay, okay, now we are ready. I'm not going  

  • 01:04:30 to test the strength, so i'm only going to test,  the different embedding checkpoints. Okay, draw  

  • 01:04:39 legend, include separate images. Keep, keep minus  one. Okay, we are ready. I'm not going to apply  

  • 01:04:45 restore faces or tiling or high resolution fix  and okay, so let's just click and see the results.  

  • 01:04:56 Oh, by the way, to get a better idea, i am  setting the batch size eight. So in each  

  • 01:05:03 in each generation, it will generate eight  images for each one of the embedding checkpoint.  

  • 01:05:14 Okay, let me also show you the speed. So it  is going to generate 25 grids because we have  

  • 01:05:20 selected 25 checkpoints and each one will be eight  images. Therefore, it will generate 200 images.  

  • 01:05:27 Currently it shows speed as 5.73 second  per iteration. Actually, per iteration is  

  • 01:05:37 currently eight images, eight steps, because we  are generating eight images parallely as a batch.  

  • 01:05:46 Therefore, it is actually eight times faster  than the regular one single image generation.  

  • 01:05:54 Okay, since it was going to take one hour and  it's already 3 am and i want to finish this  

  • 01:06:00 video today, i am going to show the results of  my previous training with exactly same data set  

  • 01:06:07 and exactly same settings and you are going to  get this kind of output after generating grid  

  • 01:06:14 images. It is actually, let me see, 90 megabytes.  So you see, these are the different checkpoints,  

  • 01:06:24 as you can see, and from these images you need to  decide which one looks best. For example, i have  

  • 01:06:33 picked in this example: testp-2400 steps count,  which means from 10 training images, 240 epochs,  

  • 01:06:47 and i have generated a lot of images from this  epoch and actually they are the ones that i have  

  • 01:06:54 shown in the beginning of the video, these ones.  So these ones were generated from the testp-2400  

  • 01:07:07 steps, as you can see. Also, the name is  written on the images description. And  

  • 01:07:12 show me one of the example and see how good it  is. It is a 3d rendering of the person. We did  

  • 01:07:22 trained and you see the quality. This is the raw  quality. I didn't upscale it or did anything and  

  • 01:07:28 it is just amazing. Let's just upscale it and  see how it looks, in the bigger resolution.  

  • 01:07:34 Okay, to do that, let's go to the extras tab  and in here i will drag and drop it one moment.  

  • 01:07:45 Okay, this image, okay, and then i am going to use  R-ESRGAN 4x+ I find this the best one. Actually,  

  • 01:07:57 you can try also anime for this one,  and let's just upscale it four times.  

  • 01:08:05 Okay, upscale is done. And look at the quality.  It is just amazingly stylized quality and these  

  • 01:08:13 are the original images. You see how good  it is. It is exactly the same person and  

  • 01:08:19 hundred percent stylized as we wanted. If  you wanted some artist to draw this it,  

  • 01:08:25 i think the artist would draw as good as only like  this, and i also didn't generate too much images  

  • 01:08:32 because i had little time. I have been doing  a lot of research, experimentation to explain  

  • 01:08:37 to you everything in this video with as  much as possible details. Now, how you can  

  • 01:08:45 combine multiple embeddings in the single query.  Let's say you have trained on multiple person,  

  • 01:08:51 multiple object and you want to use them. Or you  have trained multiple styles and you want to apply  

  • 01:08:57 them in the same query. It is just so easy. All  you need to do is just type the names of them. So  

  • 01:09:08 if you add here like this, and you, if you had,  if you add this one, they will be used both,  

  • 01:09:16 since these two are using the same tokens, their  strength will be applied, both of their strength  

  • 01:09:25 will be applied, both of their weights and vectors  will be applied. But if they were different,  

  • 01:09:32 embedding file, both of them would be  applied. So this is how you use embeddings:  

  • 01:09:41 in the text to image tab. Hopefully, i plan i plan  to work on an experiment on teaching a style and  

  • 01:09:50 object and make another video about them, but,  the principles are same. It may just require to  

  • 01:09:57 select the prepare the good training data set. You  see, this training data set is not even good. The  

  • 01:10:04 images are blurry, not high quality. The  lightning is not very good. As you can see,  

  • 01:10:10 this is a blurry image actually, and this is also  a blurry image and you will get the link of this  

  • 01:10:16 data set to see on your computer as well. However,  even though these are not very good. The results  

  • 01:10:24 are just amazing. As you can see, the textual  embeddings are very strong to teach faces as well,  

  • 01:10:31 and you can train do you can do training  on official pruned or you can do training  

  • 01:10:38 on protogen, like a protogen, a custom, very  good model or SD 2.1. And the one good side of  

  • 01:10:47 textual inversion than the DreamBooth is that, for  example, i did DreamBooth training on protogen and  

  • 01:10:53 it was a failure. However, it was a great  success for textual inversion. By the way,  

  • 01:11:00 the grid images will be saved under the outputs  folder inside text to image grids like this. When  

  • 01:11:07 you do X/Y plot generation and regular outputs  are saved in the text to image stuff like this.  

  • 01:11:14 And this is all for today. I hope you have  enjoyed it. I have worked a lot for preparing  

  • 01:11:24 this tutorial. I have read a lot of technical  documents. I have done a lot of research  

  • 01:11:30 and experimentation and please subscribe. If you  join and support us, i appreciate it. Like the  

  • 01:11:38 video, share it and if you have any questions,  just join our discord channel. To do that, go to  

  • 01:11:44 our about tab and in here you will see official  discord channel. Just click it. And if you support  

  • 01:11:49 us on patreon. I would appreciate that very much.  So far, we have 10 patrons and i thank them a  

  • 01:11:57 lot. They are keeping me to prepare more, better  videos and, hopefully see you in another video.

Clone this wiki locally