Скачать с ютуб видео Deep Dive: Optimizing LLM inference

Скачать бесплатно и смотреть ютуб-видео без блокировок Deep Dive: Optimizing LLM inference в качестве 4к (2к / 1080p)

У нас вы можете посмотреть бесплатно Deep Dive: Optimizing LLM inference или скачать в максимальном доступном качестве, которое было загружено на ютуб. Для скачивания выберите вариант из формы ниже:

Загрузить музыку / рингтон Deep Dive: Optimizing LLM inference в формате MP3:

Если кнопки скачивания не загрузились НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием, пожалуйста напишите в поддержку по адресу внизу страницы.
Спасибо за использование сервиса savevideohd.ru

Deep Dive: Optimizing LLM inference

Open-source LLMs are great for conversational applications, but they can be difficult to scale in production and deliver latency and throughput that are incompatible with your cost-performance objectives. In this video, we zoom in on optimizing LLM inference, and study key mechanisms that help reduce latency and increase throughput: the KV cache, continuous batching, and speculative decoding, including the state-of-the-art Medusa approach. Slides: https://fr.slideshare.net/slideshow/j... ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️ 00:00 Introduction 01:15 Decoder-only inference 06:05 The KV cache 11:15 Continuous batching 16:17 Speculative decoding 25:28 Speculative decoding: small off-the-shelf model 26:40 Speculative decoding: n-grams 30:25 Speculative decoding: Medusa

Comments