-
-
Notifications
You must be signed in to change notification settings - Fork 358
RTX 3090 Ti vs RTX 3060 Ultimate Showdown for Stable Diffusion ML AI and Video Rendering Performance
RTX 3090 Ti vs RTX 3060 Ultimate Showdown for Stable Diffusion, ML, AI & Video Rendering Performance
Full tutorial link > https://www.youtube.com/watch?v=lgP1LNnaUaQ
Discord: https://bit.ly/SECoursesDiscord. RTX 3090 review along with RTX 3060. Stable Diffusion, OpenAI's Whisper, DaVinci Resolve and FFmpeg benchmarked. If I have been of assistance to you and you would like to show your support for my work, please consider becoming a patron on 🥰 https://www.patreon.com/SECourses
Summary And Conclusions PDF
https://www.patreon.com/posts/81374648
Playlist of StableDiffusion Tutorials, Automatic1111 and Google Colab Guides, DreamBooth, Textual Inversion / Embedding, LoRA, AI Upscaling, Pix2Pix, Img2Img
https://www.youtube.com/playlist?list=PL_pbwdIyffsmclLl0O144nQRnezKlNdx3
Technology & Science: News, Tips, Tutorials, Tricks, Best Applications, Guides, Reviews
https://www.youtube.com/playlist?list=PL_pbwdIyffsnkay6X91BWb9rrfLATUMr3
The GitHub gist file shown in the video
https://gist.github.com/FurkanGozukara/63e1ef499b5f0c16882a5a202c37f33f
Whisper tutorial
How to install Python and Automatic1111 Web UI tutorial
Whisper github
https://github.com/openai/whisper
Davinci Resolve tutorial
Best DreamBooth training settings
How to install Torch 2 for Stable Diffusion Automatic 1111 Web UI
00:00:00 Box opening of Gainward #RTX3090 Ti and Cougar #GEX1050
00:00:51 Installation of RTX 3090 GPU and Cougar GEX1050 PSU into the computer case
00:05:03 Final view of the installed case
00:05:23 CPU, Ram and other hardware overview of the used PC
00:06:34 Used gist file explanation in this video
00:07:04 How to install latest Nvidia GeForce driver
00:07:32 What is the difference between Nvidia Game Ready drivers and Studio drivers?
00:08:43 OpenAI Whisper speech to text transcription benchmarks
00:09:52 How to verify installed and used PyTorch, CUDA and cuDNN versions via my custom script
00:10:30 How to update Whisper to latest version
00:10:53 Testing command used for Whisper
00:11:20 Demo of Whisper transcription benchmarks
00:12:32 How to install Torch version 2 on main Python installation
00:13:13 How to install cuNDD latest DLL files
00:14:24 Benchmark results of all Whisper tests
00:17:00 When RTX 3090 and #RTX3060 transcribing speech at the same time
00:18:01 4K Video rendering tests in Davinci Resolve
00:19:10 How to change rendering GPU in Davinci Resolve
00:19:35 Rendering results of Davinci Resolve benchmarks
00:20:22 Bug in Davinci Resolve, RTX 3060 is not used
00:23:00 Where to download FFmpeg with hardware acceleration - CUDA and GPU support
00:24:00 How to set default FFmpeg via environment variables path
00:25:27 Testing setup of the FFmpeg 8k video rendering
00:27:19 Demo demonstration of FFmpeg benchmark
00:27:58 Final results of FFmpeg benchmarks on both RTX 3060 and RTX 3090
00:29:45 Starting to benchmark Stable Diffusion via Automatic1111 Web UI
00:30:06 How to see used Torch, CUDA and cuDNN DLL version of your Web UI
00:30:38 How to update Web UI xFormers version
00:31:55 it/s iteration per second testing
00:32:20 Demo of testing methodologies that will be used for Stable Diffusion benchmarks
00:36:57 Starting result analysis of Stable Diffusion benchmarks Torch 1.13
00:42:48 Used DreamBooth training settings for benchmarking
00:44:00 Stable Diffusion benchmarks with Torch 2.0
00:46:48 How to make sure that Web UI uses second device in all cases
00:48:26 opt-sdp-attention benchmark results with Stable Diffusion
00:51:19 The discovery I made about optimizers used in Stable Diffusion Web UI
00:53:32 Solution for Stable Diffusion NansException : A Tensor with all NaNs in Unet
The world of artificial intelligence and machine learning is rapidly growing, and as it expands, the demand for powerful and efficient hardware is skyrocketing. Among the most critical components of this hardware are graphics cards, which play a pivotal role in the performance and capability of machine learning applications. The Nvidia RTX 3090 and RTX 3060 are two notable examples of this new generation of graphics cards, designed with machine learning in mind. This article will explore the features of these two cards, and discuss the importance of graphics cards in the field of machine learning.
The Nvidia RTX 3090 and RTX 3060
Nvidia's GeForce RTX 3090 and RTX 3060 are part of the company's Ampere architecture, which aims to provide a significant leap in performance and efficiency compared to previous generations. The RTX 3090, known as the "BFGPU" (Big Ferocious GPU), is the flagship model, boasting 24 GB of GDDR6X memory, 10,496 CUDA cores, and a memory bandwidth of 936 GB/s. This card delivers unparalleled performance, making it ideal for high-end machine learning applications, rendering, and gaming.
The RTX 3060, on the other hand, is a more budget-friendly option, but still packs a punch in terms of performance. With 12 GB of GDDR6 memory, 3,584 CUDA cores, and a memory bandwidth of 360 GB/s, the RTX 3060 provides excellent value for money, while still offering enough power to handle many machine learning tasks.
-
00:00:00 Greetings everyone, I want to express my sincere gratitude for your amazing Patreon, YouTube
-
00:00:05 join and super thanks support.
-
00:00:08 Your support is tremendously important to me on this journey.
-
00:00:11 Thankfully, I was able to purchase the powerful 24GB Gainward RTX 3090 Ti Phantom GPU and
-
00:00:22 the Cougar GX1050 80 Plus Gold power supply unit (PSU).
-
00:00:25 My main purpose for this purchase is to conduct machine learning, artificial intelligence
-
00:00:31 and video editing tasks, allowing me to carry out more research and produce high-quality
-
00:00:36 videos for all of you.
-
00:00:38 I will be showing you the speed, unboxing, and installation of the GPU and the power
-
00:00:43 supply unit (PSU) while explaining the context and the flow of this video.
-
00:00:48 The video will be divided into chapters so you can easily navigate to the parts that
-
00:00:54 interest you and skip the GPU and power supply unit (PSU) installation part.
-
00:00:58 You can find the chapters in the video description.
-
00:01:01 The installation took about one hour and then I had to spend roughly two hours figuring
-
00:01:06 out the problem that prevented the motherboard from starting.
-
00:01:10 At first, I thought the new power supply unit (PSU) was problematic.
-
00:01:14 I even had to test my older power supply unit as well.
-
00:01:17 After carefully testing each hardware component, I found that one of the older hard drives
-
00:01:23 was preventing the motherboard from starting and passing the GPU check with the new power
-
00:01:28 supply unit.
-
00:01:30 The fear of the defective new hardware or breaking existing hardware was real.
-
00:01:35 Today, I will conduct a comprehensive review complete with benchmarks in machine learning,
-
00:01:40 artificial intelligence and video rendering.
-
00:01:42 I will test RTX 3090's performance in text-to-image tasks using Stable Diffusion, speech-to-text
-
00:01:50 tasks with OpenAI's Whisper, and video rendering tasks using the free edition of DaVinci Resolve
-
00:01:56 for 4K resolution video rendering and FFmpeg video rendering to upscale 4K resolution video
-
00:02:03 into 8K.
-
00:02:04 Furthermore, I will compare the 24GB RTX 3090 to my current 12GB Gainward GeForce RTX 3060
-
00:02:15 Ghost Edition in each of these tests.
-
00:02:19 Gainward is the most affordable option available here in Turkey, which is why I chose this
-
00:02:24 brand.
-
00:02:25 I have an MSI Mag B460M Mortar motherboard that has two PCI Express revision 3 with 16
-
00:02:34 lanes x16 allowing me to use both the RTX 3060 and RTX 3090 simultaneously.
-
00:02:41 I will also demonstrate how you can run two instances of Stable Diffusion Automatic1111
-
00:02:47 Web UI on each card.
-
00:02:49 First, I will show you how to install the latest Nvidia GPU driver, followed by the
-
00:02:55 installation of PyTorch version 2 and latest CUDA toolkit.
-
00:02:59 I will also compare the performance difference between PyTorch 1 version 13 and PyTorch version
-
00:03:05 2 for Stable Diffusion and OpenAI's Whisper tests.
-
00:03:09 Next, I will start testing with OpenAI's Whisper.
-
00:03:13 I will also show you the speed of the large model transcription on each card one by one
-
00:03:18 and also simultaneously.
-
00:03:20 The large model of Whisper is extremely GPU demanding and uses a lot of VRAM.
-
00:03:25 After that, I will test the video encoding speed of DaVinci Resolve free edition.
-
00:03:29 I will render a 4K resolution video and compare the speed difference between RTX 3060 and
-
00:03:37 RTX 3090.
-
00:03:38 4K resolution video rendering is super GPU intensive.
-
00:03:42 However, since I only have free edition of DaVinci Resolve currently, it is not fully
-
00:03:48 utilizing the GPU.
-
00:03:49 Therefore, I will also upscale a 4K resolution video with FFmpeg into 8K.
-
00:03:55 FFmpeg is able to utilize 100% of the CPU.
-
00:03:58 Then I will move on to the famous Stable Diffusion text to image generative AI model using the
-
00:04:05 latest Automatic1111 web UI.
-
00:04:07 I will show how to install latest Torch and xFormers for Stable Diffusion.
-
00:04:12 I will test each card individually and at the same time with two instances.
-
00:04:17 I will compare the iterations per second speed with the base model and single batch size.
-
00:04:22 Then I will test speed of batch size 8 and maximum image resolution that each card can
-
00:04:28 generate.
-
00:04:29 Furthermore, I will test performance difference between the scaled dot product also as known
-
00:04:34 as SDP Attention which is Torch version 2 optimization and the famous xFormers.
-
00:04:40 Moreover, I will replace default cuDNN DLL files that comes with PyTorch installation
-
00:04:46 with the latest cuDNN version 8.8.1 and see if it will improve the performance when using
-
00:04:54 Stable Diffusion.
-
00:04:55 Finally, I will test DreamBooth training speed on both cards.
-
00:04:59 Additionally, I will monitor the power consumption of my computer using a wattmeter so we will
-
00:05:05 see exactly how much power is being consumed when each card is working individually or
-
00:05:11 when working simultaneously.
-
00:05:13 Let's dive in together and explore the capabilities of these powerful GPUs.
-
00:05:17 Thank you again for your support and I hope this review will be helpful to all of you.
-
00:05:23 Let's begin with looking at the hardware that I have.
-
00:05:26 For this task, I am going to use Tech Powerup GPU-Z and CPU-Z applications.
-
00:05:33 On the left GPU-Z application, we are seeing the features of the RTX 3090 Ti version graphic
-
00:05:41 card that I have.
-
00:05:42 On the right one, we are seeing the features of the RTX 3060.
-
00:05:47 There is one thing that I want you to pay attention.
-
00:05:50 You see the second card is working at PCI Express version 3 with 4x lanes.
-
00:05:58 Currently, it is displaying as version 1.1 but when it is under load, it displays PCI
-
00:06:05 Express version 3 with only 4 lanes.
-
00:06:08 And in here, you are seeing my CPU features.
-
00:06:11 Intel Core i7-10700F CPU.
-
00:06:15 It is working at 4.6 GHz usually even when it is under load.
-
00:06:20 I have 64GB RAM memory.
-
00:06:23 These are the clock speed and latency and other timings.
-
00:06:27 You can also see the other features on this CPU-Z as well.
-
00:06:30 And this is my motherboard.
-
00:06:31 MAG B-460M Mortar.
-
00:06:33 The download links of the applications that are used and shown in this video will be posted
-
00:06:39 on my gist file.
-
00:06:41 The link of this gist file will be in the description of the video.
-
00:06:45 You are seeing the currently available download links and this file will get updated later
-
00:06:50 if it is necessary.
-
00:06:51 You are also seeing the PCI Express Speed.py file.
-
00:06:55 This is the script that I have used to get these results.
-
00:06:59 You will see more scripts in here.
-
00:07:01 I will show you during the video.
-
00:07:03 So when you click the raw button here, it will open the raw file and we will begin with
-
00:07:09 downloading and installing Nvidia latest driver.
-
00:07:12 To do that, go to this link right, click and go to.
-
00:07:16 In here select your product type as GeForce or whatever the card you have.
-
00:07:21 Select your cards series from here.
-
00:07:23 This is my card series.
-
00:07:25 Everything is correct.
-
00:07:26 There are two types of drivers.
-
00:07:28 First one is game ready driver and the second one is studio driver.
-
00:07:32 You may ask what is the difference of them.
-
00:07:34 The difference is that the studio drivers are more stable.
-
00:07:38 However, I find that game ready drivers are more up to date and working as good as studio
-
00:07:45 drivers for AI tasks for machine learning tasks.
-
00:07:48 Therefore, I will be installing the latest game driver.
-
00:07:51 After selecting all, click search and it will give you download link and download it once
-
00:07:56 the download has been completed.
-
00:07:58 Right click the downloaded file.
-
00:08:00 Click run as administrator.
-
00:08:02 It will ask you to start, click yes!
-
00:08:05 Then it will extract the files into the selected drive.
-
00:08:09 You don't need to change anything.
-
00:08:10 Click ok, once the extraction has been completed, it will open the Nvidia installer, agree and
-
00:08:16 continue.
-
00:08:17 You can also pick only Nvidia graphic drivers.
-
00:08:19 It is up to you.
-
00:08:21 I am not using GeForce Experience so it is up to you.
-
00:08:23 Agree and continue.
-
00:08:25 I am picking custom advanced and in here I am selecting perform a clean installation.
-
00:08:31 This is restoring all the Nvidia settings to their default values.
-
00:08:35 This is what I prefer.
-
00:08:37 Then click next and it will make the installation.
-
00:08:40 After the installation has been completed.
-
00:08:41 If you restart your computer, it may be more better.
-
00:08:44 Now I will show you the Whisper tests of the cards.
-
00:08:48 The testing strategy will be like PyTorch 1.13, PyTorch version 2, PyTorch version 2
-
00:08:55 plus cuDNN 8.8.1.3.
-
00:08:57 Each card one by one and both cards and I will also calculate their watt usages.
-
00:09:02 If you don't know what is Whisper and how to install and use it I have an excellent
-
00:09:06 tutorial for that on my channel.
-
00:09:08 It's an awesome technology awesome public model.
-
00:09:11 You can use it to transcribe speech into text.
-
00:09:14 When I am doing nothing the watt usage of the computer is 132.
-
00:09:18 So first I will install the PyTorch 1.13 version.
-
00:09:23 I am using Python 3.10.8 as my default Python version and all of the benchmarks will be
-
00:09:30 made on this version.
-
00:09:32 If you don't know how to install Python I have an excellent tutorial this one.
-
00:09:36 You can watch it to learn and install Python.
-
00:09:39 So for installing PyTorch version 1.13 I am going to execute this command.
-
00:09:44 It will download and install PyTorch 1.13 version with CUDA 11.7.
-
00:09:50 The installation of the Torch has been completed.
-
00:09:53 Now time to verify the installed versions.
-
00:09:55 To do that, I have written a simple script.
-
00:09:59 This script is also shared in the gist file.
-
00:10:03 Just execute Python CUDA version.py and it will display you the default installed Torch
-
00:10:10 version and the used CUDA version dll files as well.
-
00:10:14 So you see the default version of CUDA is 11.7 the Torch version is 1.13 and CUDA 11.7
-
00:10:22 and the version of the cuDNN DLL file used by the PyTorch is 8.5.0.0.
-
00:10:28 Now I will update my Whisper to latest version.
-
00:10:31 The version that I am going to use is shared also in the gist file.
-
00:10:34 To do that, just copying this command, opening cmd, copy pasting, and executing.
-
00:10:40 It will update it to latest version.
-
00:10:43 For testing the Whisper I am going to use first five minutes of this video.
-
00:10:47 I have extracted the first five minutes as an mp3 file.
-
00:10:51 This is the Whisper command that I am going to use for testing purposes.
-
00:10:55 I am going to use large model version one.
-
00:10:58 I am also using best of 10 and beam size as 10.
-
00:11:01 These are significantly increasing the GPU VRAM usage and the GPU usage.
-
00:11:07 And with --device CUDA I will be providing the which GPU it should use.
-
00:11:13 So for testing with maximum performance I will stop the video recording.
-
00:11:18 But for demonstration let me show you how it is working.
-
00:11:22 So you see currently my RTX 3060 graphic card is not used at all.
-
00:11:27 I am starting a cmd window command line, pasting the code and hitting enter and it will start
-
00:11:34 processing the mp3 file the speech file.
-
00:11:38 So we will see that the GPU one which is RTX 3060 will start processing the mp3 file.
-
00:11:45 We will see the increase of the VRAM usage memory usage.
-
00:11:49 So it started processing right now.
-
00:11:51 You see this is the VRAM memory usage it is using.
-
00:11:54 Let's also see the values of GPU-Z.
-
00:11:57 In the sensor we can see that how much VRAM is used.
-
00:12:02 So it is using almost all of the VRAM.
-
00:12:04 It is showing here.
-
00:12:06 The GPU load is about 90.
-
00:12:08 However the task manager is not displaying that.
-
00:12:12 Currently I can see the GPU load in the Tech Power UP GPU-Z.
-
00:12:16 It is also displaying the power that it is using.
-
00:12:19 The wattmeter is also displaying 392 watts per hour at the moment.
-
00:12:24 So with this strategy, now I will do all of the tests while not recording video and I
-
00:12:30 will show you all of the results.
-
00:12:33 Torch 1.13 tests have been completed.
-
00:12:35 Now I will install Torch version 2 to continue testing and after all tests have been completed,
-
00:12:41 I will show you all of them.
-
00:12:43 To install Torch version 2.
-
00:12:44 First, we need to uninstall the previous Torch version.
-
00:12:47 "pip uninstall torch".
-
00:12:49 It will uninstall the Torch version 1.13.
-
00:12:50 It will ask you to proceed, click y button and it will uninstall.
-
00:12:55 Then we will execute this command.
-
00:12:57 This command will install Torch version 2.
-
00:13:00 Torch version 2 installation has been completed.
-
00:13:03 Let's verify our latest CUDA version with our script.
-
00:13:06 So you see now CUDA version is 11.8.
-
00:13:09 Torch version 2 tests also have been completed.
-
00:13:13 Now I will install the latest cuDNN files.
-
00:13:16 This is the link to download them.
-
00:13:18 This is the raw version of the gist file.
-
00:13:21 So for downloading this you have to make a free account and agree their to terms and
-
00:13:27 you can click this link to download it as a zip.
-
00:13:30 After downloading as a zip, extract it and enter inside.
-
00:13:34 Inside the folder you will see bin folder, binary folder, then copy the dll files here.
-
00:13:41 So this is the path of the dll files that you need to copy and then go inside your main
-
00:13:46 Python installation folder.
-
00:13:49 This is my Python installation folder.
-
00:13:51 Inside there go to the lib folder.
-
00:13:53 Inside here you will see site packages and inside here you will see Torch and in Torch
-
00:13:59 you will see lib folder.
-
00:14:01 Inside there.
-
00:14:02 You will see the same name file dll files, paste them there, replace them.
-
00:14:07 Now when we run our CUDA version py file, we see that cuDNN dll files used by the PyTorch
-
00:14:15 is 8801.
-
00:14:17 So this is the newest dll file that is provided by the Nvidia.
-
00:14:22 All Whisper tests have been completed.
-
00:14:25 I didn't notice any significant difference between Torch version 2 and Torch version
-
00:14:30 1.
-
00:14:31 Or with the latest cuDNN file.
-
00:14:33 RTX 3090 is able to process 2.7 seconds speech data in one second and RTX 3060 is able to
-
00:14:42 process 1.78 seconds in one second.
-
00:14:46 To ignore model loading time I waited each time the first 30 seconds speech data to be
-
00:14:54 processed.
-
00:14:55 Then I started the timer.
-
00:14:56 So these are the pure processing times ignoring the model loading times.
-
00:15:01 So this is the screenshot of the RTX 3060 when processing speech data.
-
00:15:08 You see, these are the values.
-
00:15:09 Unfortunately, TechPowerUp application do not have bigger resolution, so let me show
-
00:15:16 you with zooming in.
-
00:15:17 These are the RTX 3060 data, GPU clock, memory clock, GPU temperature.
-
00:15:23 I have taken screenshots when they have processed over 4 minutes and 35 seconds.
-
00:15:28 This is the memory used.
-
00:15:29 It is using all of the memory.
-
00:15:31 This is the GPU load, video engine load, board power draw you see this is the power consumption,
-
00:15:37 percent to TDP.
-
00:15:38 This is the hot spot in the GPU.
-
00:15:41 So these are the values of RTX 3060 during the speech to text transcription by using
-
00:15:47 OpenAI's Whisper.
-
00:15:49 So RTX 3060 took 2 minutes 16 seconds to reach processing 4 minutes 30 seconds speech data.
-
00:15:57 So now we are seeing the RTX 3090 data when it has processed 4 minutes 30 seconds speech
-
00:16:04 data.
-
00:16:06 It took 1 minute 30 seconds for RTX 3090 to process 4 minutes 30 seconds speech data.
-
00:16:12 This is the GPU clock, memory clock, GPU temperature, hot spot temperature, memory temperature,
-
00:16:17 fan speed.
-
00:16:18 GPU load is 96% very, very good.
-
00:16:21 It is almost using all of the GPU power.
-
00:16:24 Memory controller load, board power draw.
-
00:16:27 It is drawing 400 watts as you can see.
-
00:16:30 So for RTX 3090, when alone working with Whisper, I saw about 500 watts maximum usage.
-
00:16:39 You can calculate the PSU you need for RTX 3090.
-
00:16:43 However, probably this is not the maximum power that this card and the system can use.
-
00:16:48 This is for Whisper only.
-
00:16:50 So these are the other values that RTX 3090 used during transcribing speech into text.
-
00:16:57 The other Torch version and cuDNN version screenshots and the results are similar.
-
00:17:02 I will now show when both cards are running.
-
00:17:06 Now we are seeing the screenshot of when both cards started transcribing at the same moment
-
00:17:13 you see.
-
00:17:14 So when RTX 3090 had processed 4 minutes 30 seconds speech data, RTX 3060 were only able
-
00:17:24 to process 2 minutes 36 seconds speech data.
-
00:17:28 So these are the values of the transcription when both cards are working.
-
00:17:32 They are not much different when alone working actually.
-
00:17:36 So you see these are the data.
-
00:17:38 You can pause the video and look at the data if you wish.
-
00:17:42 So when both RTX 3060 and RTX 3090 were working at the same time doing Whisper transcribing,
-
00:17:50 I saw that watt meter shown 680 watts when peaking.
-
00:17:56 However, it was using 560 watts overall, so these are the watt usages as well.
-
00:18:01 Now I will do the 4k video rendering with Davinci Resolve 18.1 version: since I have
-
00:18:08 free edition, I am not able to fully utilize it.
-
00:18:11 So these are the settings that I can utilize as you see in the left panel here.
-
00:18:15 Currently RTX 3090 is selected.
-
00:18:18 The way that I am going to use is I will start rendering and see the timing that it takes.
-
00:18:27 Let me demonstrate you.
-
00:18:28 Then I will close video recording and recalculate the time because video recording is also taking
-
00:18:34 a lot of GPU power let me show you.
-
00:18:37 So you see when I am recording video it is using a lot of GPU, the OBS studio, and Nvidia
-
00:18:44 Broadcast.
-
00:18:45 Therefore, I have to stop video recording and do the speed test.
-
00:18:49 So I am adding the project to the render queue and hit render all.
-
00:18:53 In the right tab you see it is demonstrating the remaining time.
-
00:18:57 I will wait a little bit then this remaining time will become stable and I will see the
-
00:19:03 duration that it is going to take for each card.
-
00:19:06 The RTX 3090 rendering test has been completed.
-
00:19:10 Now I am selecting the default GPU as RTX 3060.
-
00:19:15 These are the settings that I am using in DaVinci Resolve.
-
00:19:18 Then I click save.
-
00:19:20 The DaVinci Resolve results are very interesting.
-
00:19:23 Probably I have discovered a bug in the software.
-
00:19:27 By the way, if you don't know how to use DaVinci Resolve, I have an excellent tutorial for
-
00:19:31 Davinci Resolve: the link will be in the description.
-
00:19:34 So when we look at the rendering results of the RTX 3090, we see that the GPU load is
-
00:19:41 15% video engine load is 91%.
-
00:19:44 This screenshot taken after two minutes the rendering started.
-
00:19:49 So these are the values of the each card during the rendering of the 4k resolution video.
-
00:19:55 It is using minimal amount of memory, not much amount of GPU using a lot of video engine
-
00:20:02 load.
-
00:20:03 This is the board power it uses.
-
00:20:05 I checked from the watt meter and it was using around 250 watts in total.
-
00:20:10 So these are the values.
-
00:20:12 So for rendering 40 minutes 4k video, it was taking 68 minutes in total and these are the
-
00:20:20 rendering settings for 4k video.
-
00:20:22 Now, the interesting part is that when I change the selected GPU from here like this and save
-
00:20:29 and restart the application, it simply do not obey our settings.
-
00:20:34 Why I am saying that you see that RTX 3090 is still being used for video engine, which
-
00:20:41 means rendering as you can see.
-
00:20:43 However, this time the GPU load of RTX 3060 is not zero, so it is very weird.
-
00:20:50 I don't know what is actually happening.
-
00:20:52 The GPU load is now from taken from RTX 3090 to RTX 3060.
-
00:20:59 However, video engine load is still on the RTX 3090.
-
00:21:05 So these are the values of the each card when RTX 3060 is selected.
-
00:21:10 I am not able to select both of the cards at the same time with Davinci Resolve free
-
00:21:15 edition.
-
00:21:16 Probably this is a bug.
-
00:21:18 You see Bus interface load is only 16%.
-
00:21:22 Even though this card is working with PCI Express revision 3 with only four lanes x4
-
00:21:28 instead of 16 lanes, it is only using 16% of the 4x lane.
-
00:21:34 RTX 3090 is only using 3% for video rendering with DaVinci Resolve.
-
00:21:39 Even when two of the cards were being simultaneously used for Whisper transcribing task, we see
-
00:21:46 that the Bus interface load is still very minimal for RTX 3060 only 4%.
-
00:21:53 So for both of the benchmarks, the second PCI express is being 4x didn't make any significant
-
00:22:00 effect for RTX 3060 and we were able to utilize the graphic card fully.
-
00:22:06 In the Whisper the graphic cards were used with GPU load as you are seeing.
-
00:22:11 So before starting FFmpeg video encoding tests, I also wanted to show what kind of clip I
-
00:22:18 have rendered in the DaVinci Resolve.
-
00:22:21 I didn't use any effects or other things.
-
00:22:23 This is the clip that have been rendered.
-
00:22:26 So these are the clip properties currently you are seeing right now and when rendering
-
00:22:32 the sound was muted.
-
00:22:33 However, still I have exported the audio as well.
-
00:22:36 The recording was made with my Xiaomi phone.
-
00:22:39 It has overall bit rate 42 megabits per second as you are seeing right now.
-
00:22:44 This is only a part of clip.
-
00:22:46 The full render was multiple clips composition.
-
00:22:50 So it was a huge clip actually.
-
00:22:52 It is using chroma subsampling 4:2:0.
-
00:22:54 8 bits progressive scan.
-
00:22:58 So these are all the attributes of the clip that has been used in the DaVinci Resolve
-
00:23:04 rendering benchmark.
-
00:23:06 For doing FFmpeg rendering tests we need hardware acceleration enabled FFmpeg version.
-
00:23:14 A version with CUDA support.
-
00:23:16 To download it, we are going to use this link.
-
00:23:18 GyanD releases.
-
00:23:19 GyanD FFmpeg releases have hardware acceleration support so you can download FFmpeg 6.0 full
-
00:23:28 build or whatever the latest version you are seeing when you are watching this video.
-
00:23:34 Now I will set environment variable path for the newest downloaded FFmpeg.
-
00:23:39 I will extract it into my c drive.
-
00:23:41 So I am cutting the binary files in the FFmpeg extracted folder.
-
00:23:46 So this is the folder where I did put the FFmpeg files.
-
00:23:50 Currently when I open my cmd window and type FFmpeg -version, this is the version currently
-
00:23:57 installed.
-
00:23:58 Now I will change the default FFmpeg file to the this newest file.
-
00:24:03 To setting default FFmpeg.
-
00:24:04 We are opening environment variables type envi and you will see this: edit: environment
-
00:24:10 variables.
-
00:24:11 In this screen.
-
00:24:12 Click environment variables in here we will add path of the FFmpeg.
-
00:24:17 I am going to do that for both user variables and system variables.
-
00:24:21 First click edit for user variables click new click browse select the folder where you
-
00:24:27 have installed.
-
00:24:28 So this is my latest installed one.
-
00:24:31 I am selecting it, then I am moving it to very top like this.
-
00:24:35 So now it is the first one.
-
00:24:37 Click ok then do the same thing for path variable in the system variables: click edit click
-
00:24:44 new click browse select the FFmpeg folder where you have extracted the binary files,
-
00:24:50 then move it to the very top.
-
00:24:52 So whatever is at the top will be used first in the path in the environment variables.
-
00:24:59 Click ok click ok click ok.
-
00:25:01 Then start a new cmd window.
-
00:25:04 Let's execute the same command FFmpeg version and now we are seeing the latest version FFmpeg
-
00:25:10 is being used.
-
00:25:12 It is displaying all of the capabilities.
-
00:25:14 Let's also see the CUDA support.
-
00:25:16 So these are the supported acceleration methods of the latest FFmpeg I am using.
-
00:25:22 I have prepared the FFmpeg commands that we are going to use for benchmarking and put
-
00:25:27 it on the gist file as well.
-
00:25:29 When you click raw, you will get the all of the description.
-
00:25:33 Let's see it in the notepad plus plus so we will see it better.
-
00:25:37 When you execute this.
-
00:25:39 It will hide the first GPU so that I will be able to use second GPU with the FFmpeg.
-
00:25:47 This is the real command that we are going to execute for the test.
-
00:25:51 It is using slow preset, variational bitrate also targeting 100 megabits per second.
-
00:25:58 It is also looking for the next 60 frames to calculate the necessary required bitrate,
-
00:26:05 max bitrate, buff size, and other options.
-
00:26:08 All of the options used are listed in here.
-
00:26:12 You can read them.
-
00:26:13 You can modify the prompt as you wish.
-
00:26:15 This is an extremely useful file.
-
00:26:18 So for the test file, I am going to use this mp4 file.
-
00:26:23 It has 42 megabits per second overall frame rate.
-
00:26:28 These are the attributes of the file as you are seeing right now.
-
00:26:32 By the way I also need to mention that the FFmpeg command we are going to use is going
-
00:26:38 to double the resolution of the input video file.
-
00:26:42 The input video file has 4k resolution, 3840 pixels for width and 2160 pixels for height.
-
00:26:51 We are going to upscale the video into 7680 pixels width and 4320 pixels height.
-
00:27:01 Moreover, I have turned all of the options into average values in the GPU-Z applications.
-
00:27:08 I also have opened core temp so when I take screenshot during the actual benchmarking,
-
00:27:13 we will see the average values.
-
00:27:15 When I start the benchmarking, I will reset the value so we will see the average values
-
00:27:20 of the testing.
-
00:27:21 So this will be the prompt.
-
00:27:23 It will start encoding.
-
00:27:24 When we reset, we will get the average values of the testing.
-
00:27:29 So currently it is using 100% video engine of the RTX 3090 graphic card.
-
00:27:37 The CPU load is pretty low actually.
-
00:27:40 As we are seeing right now in the core temp application.
-
00:27:42 The FFmpeg video encoding results are extremely shocking to me because RTX 3060 performed
-
00:27:51 almost same as 3090 with much lesser system watt usage.
-
00:27:56 Let's see the screenshots.
-
00:27:58 So currently we are looking at the screenshot of the 3060.
-
00:28:03 Let me show you.
-
00:28:04 These are the GPU clock memory clock, GPU temperature, hotspot temperature, fan speed,
-
00:28:10 the memory used, GPU load.
-
00:28:11 You see the video engine load is 100%, board power draw, and other variables you see.
-
00:28:18 Bus interface load is 0%.
-
00:28:20 That means that 4x PCI express didn't make any effect in this test as well.
-
00:28:25 The CPU usage is also very low.
-
00:28:27 Full load was on the GPU and here you see the bitrate of the generated file.
-
00:28:33 So far processed duration of the video file.
-
00:28:36 This is the speed.
-
00:28:37 This is the file size so far generated.
-
00:28:40 So these are all the results of RTX 3060.
-
00:28:42 Meanwhile, you can see the values of RTX 3090 as well.
-
00:28:48 Video engine load is 0% as expected.
-
00:28:51 Now we are seeing the values of RTX 3090 when encoding the video with FFmpeg.
-
00:28:57 It uses 5 gigabyte VRAM memory.
-
00:29:00 GPU load is 3%, video engine load is 100%, board power draw is 128 watts.
-
00:29:07 The speed is almost same as RTX 3060.
-
00:29:10 This is very, very interesting if you ask me.
-
00:29:15 So it processed 67 seconds in 4 minutes of processing time.
-
00:29:18 CPU load was also very low for this test as well.
-
00:29:22 So these are the values of RTX 3060.
-
00:29:26 Meanwhile, RTX 3090 is processing.
-
00:29:28 RTX 3060 was not used.
-
00:29:30 When both cards started processing at the same time.
-
00:29:33 It didn't make any changes because CPU was already able to handle all of the process
-
00:29:39 and these are the results when both cards are processing at the same time.
-
00:29:44 Nothing significantly different.
-
00:29:45 Now the fun part starts.
-
00:29:48 I am going to extensively test Automatic1111 web UI.
-
00:29:52 I have made two fresh installation of the web UI.
-
00:29:55 This is the first installation.
-
00:29:57 This is the only command line arguments that I have made.
-
00:30:00 xFormers and device id currently.
-
00:30:04 Let's also see the venv folder values.
-
00:30:06 So go to the venv scripts type cmd, activate, then I will execute the CUDA version py in
-
00:30:13 this venv environment.
-
00:30:14 Keep pressing shift button right click copy as path, then type Python paste it.
-
00:30:21 This script will be executed in this venv environment.
-
00:30:25 Currently, you see this venv environment has CUDA version 11.7.
-
00:30:30 Torch version 1.13 and the xFormers version is not up to date.
-
00:30:34 I will also update it with the command I have on the gist.
-
00:30:38 Click raw.
-
00:30:39 So for updating xFormers, I am copying the command here.
-
00:30:43 It is starting from here.
-
00:30:45 Copy it, paste it, hit enter and it will install the latest xFormers version compatible with
-
00:30:52 the Torch version 1.13.
-
00:30:53 This was the default version installed with the Automatic1111 web ui.
-
00:31:00 I will do the same in the second installation of the web ui as well.
-
00:31:04 Cmd activate execute command.
-
00:31:06 Then I am starting the web ui instances with webui- user.bat file.
-
00:31:12 The first one is going to use device id 0 which means that it will use RTX 3090.
-
00:31:19 The second web ui instance is going to use device id 1.
-
00:31:23 So this is the first instance of the web ui.
-
00:31:26 It is using these versions: Python 3.10.8, Torch 1.13, CUDA 11.7, xFormers 0018 dev 494
-
00:31:38 gradio commit and checkpoint.
-
00:31:40 Second instance of the web ui also started this time.
-
00:31:43 It is using the port 7861.
-
00:31:47 So this is the second instance.
-
00:31:49 It is also using the same versions.
-
00:31:51 This one is running on RTX 3060.
-
00:31:55 So I will do numerous testing on Stable Diffusion.
-
00:31:58 The first test is the it per second.
-
00:32:00 How am I going to do that?
-
00:32:01 I will use a simple prompt such as sports car.
-
00:32:05 You see this is the default model that Stable Diffusion web ui downloads.
-
00:32:11 I am making the sampling steps 150, batch count as 100.
-
00:32:17 So that I will be able to see how much it per second it is doing.
-
00:32:22 Let's start generate.
-
00:32:23 We are seeing about 40 it per second for RTX 3090.
-
00:32:28 Let's also look at the tech power up GPU values.
-
00:32:32 By the way, currently I am recording video.
-
00:32:34 It is using a lot of GPU already.
-
00:32:37 Therefore these are not best values.
-
00:32:39 I will stop the video recording, do the test, and then I will show you the screenshots.
-
00:32:44 I am just showing you the methodology that I am going to use.
-
00:32:48 Currently RTX 3090 is using six gigabyte VRAM memory and RTX 3060 is using 2.7 gigabyte
-
00:32:57 VRAM memory.
-
00:32:58 Let's also do some testing with that as well.
-
00:33:01 So, the RTX 3060 is performing about 7 it per second.
-
00:33:07 It is almost as half of the RTX 3090.
-
00:33:13 Let's also see the Bus usage.
-
00:33:15 Let's reset.
-
00:33:16 So the Bus interface load of RTX 3060 is only 2%.
-
00:33:21 That means that using PCI Express version three with only four lanes is not reducing
-
00:33:28 our GPU power.
-
00:33:30 By the way, RTX 3060 is consuming 150 watts and RTX 3090 is consuming 325 watts.
-
00:33:38 I see that 640 watts in my watt mete currently.
-
00:33:43 The watt usage is pretty high.
-
00:33:46 Also, CPU usage is not very high actually.
-
00:33:49 It is looking decent.
-
00:33:51 So this is the first test that I am going to do.
-
00:33:53 The second test methodology that I am going to do is the maximum resolution of images
-
00:33:59 that they can generate.
-
00:34:00 So for example, let's try 2048 pixel and 2048 pixel generation on the both cards.
-
00:34:08 Okay, I am doing the same.
-
00:34:11 So this will of course significantly increase the VRAM usage.
-
00:34:15 Let's see.
-
00:34:16 The VRAM usage of the RTX 3090 is.
-
00:34:19 Oh, looks like this failed.
-
00:34:22 Okay, looks like to be able to use this.
-
00:34:24 I need to add --no-half So I will also do that.
-
00:34:29 When you add --no-half command line argument it significantly reduces your speed.
-
00:34:35 Therefore, it is not optimal to test it per second.
-
00:34:39 When I am doing it per second testing i won't use it.
-
00:34:43 When I am doing maximum resolution testing, I will use it.
-
00:34:47 And the third test that I will do will be DreamBooth training.
-
00:34:50 I will repeat all of these tests with Torch version 1.13, Torch version 2, and with the
-
00:34:58 latest cuDNN files like I have done in the Whisper testing.
-
00:35:03 So now I will start testing and I will show you the results' screenshots.
-
00:35:08 All Stable Diffusion tests are completed but before delving into amazing results, shocking
-
00:35:15 results, and some new discoveries, I would like to show you my Stable Diffusion playlist.
-
00:35:22 So if you don't know how to use Stable Diffusion, I have an amazing playlist that is up to date.
-
00:35:27 If you don't have a good graphic card, the first two videos will teach you how to use
-
00:35:32 Stable Diffusion Dreambooth training on Google Colab for free.
-
00:35:36 Then the third video will show you how to install and use Automatic1111 we UI.
-
00:35:42 Fourth video will also help you with that.
-
00:35:45 The fifth video is my master video for DreamBooth training.
-
00:35:48 The sixth video is my master video for Textual Inversion training.
-
00:35:53 The seventh video is the master video for LoRA training.
-
00:35:56 Then eighth video, ninth video, and other videos will help you to improve your knowledge.
-
00:36:02 As you go down, you will get more recently recorded videos.
-
00:36:05 For example, sketches into epic art will teach you how to install and use ControlNet.
-
00:36:11 The ultimate RunPod tutorial will teach you how to use Stable Diffusion Automatic1111
-
00:36:16 Web UI on the RunPod.
-
00:36:18 Then this epic Web UI DreamBooth update video will teach you a lot of settings, configuration
-
00:36:25 about DreamBooth training on Automatic1111 Web UI.
-
00:36:29 You see all of the videos I have are extremely useful.
-
00:36:33 You can learn how to teach style and yourself at the same time into a model.
-
00:36:39 That means teaching two concepts at the same time into a model.
-
00:36:43 I also have epic animation video and also Kandinsky 2.1 version tutorial as well.
-
00:36:51 So check out the playlist.
-
00:36:53 The playlist link will be in the description and also will be in the comments as well.
-
00:36:57 So now we can start analyzing our results.
-
00:37:01 The first results I have made are based on the PyTorch 1.13 version, xFormers version
-
00:37:08 0.0.18 dev 494 and cudNN DLL files are 8500.
-
00:37:17 When generating images with 512 512 on the base model, RTX 3090 reached 14.61 it per
-
00:37:26 second.
-
00:37:27 Let's open the testing image.
-
00:37:30 So this is the image of CUDA version 11.7, Torch version 1.13, RTX 3090.
-
00:37:37 You see it has reached 14.61 it per second.
-
00:37:42 I have used 512 512.
-
00:37:44 This is the command line arguments I have used only xFormers device id 0.
-
00:37:50 This is the results of CUDA version .py file.
-
00:37:53 You will find this file in the gist repository.
-
00:37:56 The CPU usage is pretty low.
-
00:37:58 The GPU load is only 75%.
-
00:38:01 The GPU user 320 watts while working.
-
00:38:04 It used about 5 gigabyte VRAM memory.
-
00:38:07 These are the other values of the card when generating 512 images.
-
00:38:12 I just used sports car and I used the base model as I have shown in the beginning.
-
00:38:18 So now we are seeing the results of RTX 3060.
-
00:38:22 It has used about 4 gigabytes VRAM.
-
00:38:24 GPU load is higher than the RTX 3090.
-
00:38:28 It used almost all of it.
-
00:38:29 It used about 150 watts.
-
00:38:32 You are seeing the GPU temperature, hotspot temperature, and other values as you are seeing
-
00:38:37 right now.
-
00:38:38 So these are the results for RTX 3060.
-
00:38:42 I have written incorrect arguments here.
-
00:38:44 Actually, I have used --device id 1 and I am also improving it in the other test I will
-
00:38:50 show you.
-
00:38:51 So for image generation with 512, RTX 3090 is over 100% faster than RTX 3060.
-
00:38:59 When we do hires fix image generation I have tested all of the resolutions.
-
00:39:07 So I have started testing from 1920 pixel and it caused NaN-error every time.
-
00:39:15 So the highest resolution that I was able to generate images with high res fix was 1216
-
00:39:22 1216.
-
00:39:23 When we go upper it was generating NaN-images in the UNET.
-
00:39:28 So currently we are seeing the hires result of RTX 3060 with Torch version 1.13.
-
00:39:36 It was taking 18.28 seconds per it when generating the high resolution image.
-
00:39:45 You know first it generates the simple image, then it generates the hires resolution of
-
00:39:49 it.
-
00:39:50 This is the speed of the RTX 3060.
-
00:39:53 It was using 7 gigabyte of VRAM memory.
-
00:39:55 It was using GPU load 100% drawing 160 watt from the PSU.
-
00:40:03 Still Bus interface load was very low.
-
00:40:05 So PCI Express revision 3 with only 4 lanes was not a problem at all.
-
00:40:11 So these are the values of the RTX 3060 when generating high resolution and hires fix image.
-
00:40:19 When we look at the results of RTX 3090, it was much faster than RTX 3060.
-
00:40:28 The card shined when generating high resolution images.
-
00:40:32 So this time RTX 3090 was able to utilize 100 of GPU.
-
00:40:38 It was using 8 gigabyte VRAM memory.
-
00:40:41 It was drawing 441 watts from the PSU as you are seeing right now.
-
00:40:46 It is a beast.
-
00:40:47 The CPU usage also was not very high you are seeing right now.
-
00:40:51 The load of the CPU is pretty low.
-
00:40:53 So most of the hard work is done by the GPU, not the CPU when generating high resolution
-
00:40:59 images.
-
00:41:00 So these are the other values of the GPU.
-
00:41:03 You can just pause the video and look at them.
-
00:41:05 When both of the cards were generating high resolution images, I have seen that the wattmeter
-
00:41:12 was reading 750 watts per hour.
-
00:41:16 So if you want to use two cards at the same time, you should get minimum 800, perhaps
-
00:41:22 850 watts a good PSU.
-
00:41:26 If you use only RTX 3090, perhaps you may get away with 650 watts good PSU.
-
00:41:33 So for high resolution image generation, the RTX 3090 was almost 200% faster.
-
00:41:42 You are seeing the difference right now.
-
00:41:44 The lower value is better because this is second per it and the above is it per second.
-
00:41:51 Just the opposite.
-
00:41:52 So one iteration took 6.21 seconds for RTX 3090 to process.
-
00:41:58 It took 18.28 seconds per it to process for RTX 3060.
-
00:42:05 So for DreamBooth training, I used the same settings and RTX 3090 performed 3.03 it per
-
00:42:13 second and RTX 3060 performed 2.20 it per second.
-
00:42:19 So this is the image of RTX 3090 when doing DreamBooth training.
-
00:42:24 It was only utilizing 60% of the GPU.
-
00:42:27 It was using only 240 watts per hour.
-
00:42:31 So these are the settings used for DreamBooth training.
-
00:42:34 It was performing 3.03 it per second.
-
00:42:38 These are the command line arguments: xFormers device id 0.
-
00:42:43 This is the CPU usage.
-
00:42:44 This is the used CUDA version, Torch version, and the dll file version.
-
00:42:48 I will also quickly show you the DreamBooth settings.
-
00:42:51 These are not important.
-
00:42:53 The important part is I didn't use gradient checkpointing.
-
00:42:56 Set gradients to none when zeroing was checked.
-
00:42:58 This is the learning rate, but this is not also important.
-
00:43:01 Max resolution is 512.
-
00:43:03 This is important.
-
00:43:04 I didn't use EMA because RTX 3060 is not able to do training with EMA.
-
00:43:11 As optimizer I have used 8bit AdamW because with other optimizers, RTX 3060 is not able
-
00:43:18 to do training.
-
00:43:19 Out of memory error: mixed precision bf16 xFormers is enabled.
-
00:43:25 Cache latents is enabled.
-
00:43:26 Train UNET is enabled.
-
00:43:28 This is the step rate of text encoder training, so these are the other values.
-
00:43:33 I didn't use offset noise.
-
00:43:35 I also have used classification images so the bucket size were double of the training
-
00:43:40 data set.
-
00:43:41 Actually, the training data set size is not affecting a lot and these are the saving settings.
-
00:43:46 But these are not the best settings.
-
00:43:48 For best settings check out my channel.
-
00:43:51 Actually for best settings for DreamBooth, check out this video.
-
00:43:54 This is the best video that you will find for best settings of DreamBooth training.
-
00:43:58 Okay then next I have installed PyTorch version 2 with xFormers 0.0.18 and cuDNN file 8700.
-
00:44:08 I have followed the same steps that I have shown in the beginning.
-
00:44:13 Also, by watching this video you can learn more about how to install these versions in
-
00:44:20 more details along with the DreamBooth installation.
-
00:44:23 This is the video how to install new DreamBooth and Torch 2 on Automatic1111 web UI.
-
00:44:28 Then we see some difference.
-
00:44:30 The it per second is raised from 14.61 to 15.83 for RTX 3090.
-
00:44:37 However, for RTX 3060, it is reduced 7.07 to 6.8.
-
00:44:48 So therefore RTX 3090 gained over 8% speed increase and RTX 3060 lost 4% speed.
-
00:44:59 When doing high res fixed image generation, RTX 3090 speed up from 6.21 second it to 6.04
-
00:45:12 seconds per it, which means around 3% speed increase.
-
00:45:17 For RTX 3060 the speed gain is more dramatic.
-
00:45:22 You see when generating the regular 512 512 image the RTX 3060 lost speed.
-
00:45:29 However when generating high resolution image with high res fix, it gained speed.
-
00:45:35 It increased second per it from 18.28 to 17.37 seconds per it, which is equal to around 5%
-
00:45:47 speed gain.
-
00:45:48 So depending on your purpose, you can use either PyTorch version 2 or PyTorch version
-
00:45:54 1.13 with your RTX 3060.
-
00:45:57 But for RTX 3090, always use PyTorch version 2 with the latest version of xFormers.
-
00:46:05 For DreamBooth there is also speed gaining.
-
00:46:08 From 3.03 it per second to 3.24 it per second in RTX 3090.
-
00:46:15 Also with RTX 3060 from 2.20 it per second to 2.27 it per second.
-
00:46:22 So right now we are seeing the screenshot of RTX 3060.
-
00:46:26 When generating 512 images, it is using about 95% of GPU load.
-
00:46:32 Now it is using a little bit more board power.
-
00:46:36 These are the other values.
-
00:46:37 You see.
-
00:46:38 The CUDA version is 11.7.
-
00:46:40 The below also you are seeing the displayed values here and this is the it per second
-
00:46:46 value.
-
00:46:47 To be able to use second card with DreamBooth, I have to change the webui-user.bat file.
-
00:46:55 You have to add set CUDA visible devices equal one.
-
00:47:00 Otherwise, when doing DreamBooth training, it is using the device zero.
-
00:47:04 So if you have two devices and you want to do DreamBooth training on the second device,
-
00:47:09 you need to add this.
-
00:47:10 This ensures that the second device is used and the first device is not used and as a
-
00:47:17 command line argument I am just using xFormers.
-
00:47:20 So this is the screenshot of the RTX 3090 when generating 512 images.
-
00:47:26 This time it is using about 80% of GPU higher than before and these are the other values.Board
-
00:47:33 power draw and the other values you are seeing.
-
00:47:36 This time the CUDA version was 11.7.
-
00:47:39 Torch version is like here: this is the it per second.
-
00:47:42 These are the other versions used and this is the command line arguments used.
-
00:47:47 So these are the high res values for Torch Version 2 for RTX 3090.
-
00:47:52 It is using GPU 100%.
-
00:47:55 Memory controller load is also pretty high.
-
00:47:59 It is drawing 437 watts from the PSU.
-
00:48:04 Using about 9 gigabyte VRAM memory and these are the other values of the card while generating
-
00:48:10 high res image.
-
00:48:12 It is taking 6.04 seconds per it.
-
00:48:16 This is the resolution and high res fix is enabled and these are the command line arguments
-
00:48:21 used and these are the versions.
-
00:48:24 So these are the DreamBooth training values.
-
00:48:26 And then I have tested: --opt-sdp-attention You may wonder what is this.
-
00:48:31 This is an optimization that is supposed to replace xFormers with Torch version 2.
-
00:48:41 You see currently I am displaying the wiki of the Automatic1111 web UI.
-
00:48:46 It says that --opt-sdp-attention faster speeds than using xFormers.
-
00:48:53 Only available for users who manually install Torch version 2 and also this is non-deterministic.
-
00:49:00 There is also deterministic version, but it is a little bit slower.
-
00:49:04 There is also xFormers you are seeing --force enable xFormers, opt split attention, cross
-
00:49:11 attention layer optimization significantly reducing memory usage for almost no cost.
-
00:49:16 Some report improved performance with it.
-
00:49:18 This is by default on so you don't need to add this.
-
00:49:22 This is by default on.
-
00:49:24 Also, this below one disable state optimization and there are some other optimization parameters.
-
00:49:30 As you are seeing right now.
-
00:49:32 --medvram --lowvram.
-
00:49:34 So this is Torch version 2 optimization: when we are using this, I didn't use xFormers.
-
00:49:41 This increased the speed of the image generation for RTX 3090.
-
00:49:48 When generating 512 images you see, it jumped from 15.83 it per second to 16.39 it per second.
-
00:50:01 It didn't change the image generation speed for RTX 3060.
-
00:50:05 So if you are going to generate images with RTX 3090 or newer cards, you can use --opt-sdp-attention:
-
00:50:12 however, there is one very big issue.
-
00:50:18 When generating high resolution image with high res fixed image generation.
-
00:50:23 There was a very weird bug.
-
00:50:25 It was generating image and after the last step when composing the image, it produced
-
00:50:31 memory error.
-
00:50:32 Now let me show you some screenshots.
-
00:50:35 Currently we are seeing the test results of opt SDP Attention device id 0 RTX 3090 so
-
00:50:43 these are the values: power draw is around 350 watts.
-
00:50:48 It is able to generate images with 16.34 it per second.
-
00:50:52 These are the usb CUDA version, Torch version and the dll file version.
-
00:50:57 You are seeing xFormers is N/A not available torch version.
-
00:51:01 And now we are seeing the results of RTX 3060.
-
00:51:05 When opt SDP Attention is enabled, it is able to reach 6.80 it per second.
-
00:51:13 It is using 96% of the GPU.
-
00:51:16 CPU usage is never a lot so CPU was never the limiting factor when I am doing experiments.
-
00:51:23 Now the very interesting discovery that I have made.
-
00:51:27 When opt SDP Attention is enabled.
-
00:51:30 I will show you a very interesting thing.
-
00:51:33 So you see it was using very minimal amount of VRAM around six gigabyte during the high
-
00:51:40 res fixe image generation with RTX 3090.
-
00:51:45 However, after the generation is done and when generating the image, you see the memory
-
00:51:51 usage is just skyrocketing and then we are getting out of memory error.
-
00:51:57 So the message is in sdp attention block forward we got Torch CUDA out of memory error.
-
00:52:04 CUDA out of memory.
-
00:52:06 Tried to allocate 15.91 gigabyte on the GPU-0 and it exceeded the total capacity.
-
00:52:14 So with this attention we were not able to generate high res fix image.
-
00:52:20 So you need to use xFormers if you want to generate high resolution images and also get
-
00:52:26 speed up.
-
00:52:27 So in the DreamBooth also, opt SDP Attention reduced the it per second speed.
-
00:52:33 It reduced from 3.24 to 2.47 it per second for RTX 3090, this is a very dramatic speed
-
00:52:42 decrease actually.
-
00:52:43 It lost around 25% speed with just using opt SDP Attention instead of using xFormers.
-
00:52:52 The same speed decrease also exists in RTX 3060.
-
00:52:56 Then I did tests with cuDNN 8801 dll file and there weren't any significant speed difference
-
00:53:05 between the cuDNN 8700 dll file.
-
00:53:10 While thinking about results, a new discovery came to my mind.
-
00:53:15 If you don't use any optimizations such as xFormers or opt SDP Attention, then you can
-
00:53:23 generate very high resolution images.
-
00:53:25 For example, if you try to generate over 1448 1448 pixel when xFormers is enabled, you will
-
00:53:35 get NansException.
-
00:53:36 A tensor with all NaNs was produced in Unet.
-
00:53:41 This could be either because there is not enough precision to present the picture or
-
00:53:45 because your video card does not support half type.
-
00:53:48 However, this is not true.
-
00:53:50 Even if you enable float 32 or --no-half still, you won't be able to pass this resolution.
-
00:53:58 However, if you don't use xFormers or opt SDP Attention, then both of the cards RTX
-
00:54:06 3090 or RTX 3060 is able to generate 2048 pixel 2048 pixel resolution image.
-
00:54:15 This also applies to hd resolution fix as well.
-
00:54:19 So if you are getting NaNs exception error, let me demonstrate you so with 1456 pixel
-
00:54:27 resolution, even on RTX 3090, I am getting this NaNs exception error.
-
00:54:34 However, when no xFormers or other optimizations are used, both of the cards are perfectly
-
00:54:40 able to generate with 2048 pixel and 2048 pixel resolution and they will be also able
-
00:54:48 to generate even higher high res fix as well.
-
00:54:52 Let's also see the values.
-
00:54:54 So you see RTX 3060 is using about 11 gigabyte VRAM memory for this resolution with 100%
-
00:55:03 GPU load with 150 watts per hour.
-
00:55:07 These are also the other values as you are seeing right now.
-
00:55:10 RTX 3090, on the other hand is using 18 gigabyte VRAM memory for the same resolution with 100%
-
00:55:18 GPU load.
-
00:55:19 So with higher resolution we are able to hit 100% with RTX 3090.
-
00:55:25 Memory controller load is also 100% and it is consuming 450 watts per hour from our power
-
00:55:33 supply unit.
-
00:55:34 So this is a new discovery that I have made.
-
00:55:36 This is all for today.
-
00:55:38 If you join, like subscribe, leave a comment I would appreciate that very much.
-
00:55:43 Your joins your Patreon support is extremely important.
-
00:55:46 You can join our channel from here.
-
00:55:49 Also in the video description and in the comments you will see our Discord link.
-
00:55:53 Like this you will see our Patreon page link like this.
-
00:55:57 Also in the comment you will see.
-
00:55:58 If you are looking for the summary of the video.
-
00:56:02 Hopefully I will post it on our Patreon page.
-
00:56:06 Let me show you.
-
00:56:07 The summary and the discoveries of this video.
-
00:56:10 All of these benchmarks will be posted on our Patreon page.
-
00:56:14 I am also posting some other useful stuff on our Patreon page.
-
00:56:18 So you don't of course have to support me on Patreon.
-
00:56:23 But if you support me, I would appreciate that very much.
-
00:56:26 Your support is tremendously important for me.
-
00:56:29 I am spending so much time to make these videos.
-
00:56:32 For example, this training video took about three days full working time and my views
-
00:56:37 are not that great.
-
00:56:39 So your support is tremendously important for me.
-
00:56:42 I hope that you consider to support and from the Patreon you will be able to read the summary
-
00:56:45 and the discoveries of these benchmarks.
-
00:56:46 Hopefully see you later in another awesome video.
-
00:56:47 Thank you very much.
