Best Video Encoding Podcasts (2024)

1
Science & Clinical LLMs Leaps, Enhancing Small Model Reasoning, New Frontiers in Controlled Media Generation 14:24

5h ago14:24

14:24

The AI Scientist: Towards Fully Automated Open-Ended Scientific DiscoveryMed42-v2: A Suite of Clinical LLMsMutual Reasoning Makes Smaller LLMs Stronger Problem-SolversControlNeXt: Powerful and Efficient Control for Image and Video GenerationCogVideoX: Text-to-Video Diffusion Models with An Expert TransformerFruitNeRF: A Unified Neural Radiance Fiel…

1
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation 33:50

2h ago33:50

33:50

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this…

1
AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls 41:29

1h ago41:29

41:29

We introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries. We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries. AnyTool primarily incorporates three elements: an API retrieve…

1
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads 38:55

7d ago38:55

38:55

Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been…

1
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders 29:11

5d ago29:11

29:11

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-…

1
Multimodal Benchmarks, Visual Task Transfer, and 3D Object Generation 14:15

5d ago14:15

14:15

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language ModelsLLaVA-OneVision: Easy Visual Task TransferAn Object is Worth 64x64 Pixels: Generating 3D Object via Image DiffusionMedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for MedicineIPAdapter-Instruct: Resolving Ambiguity in Image-based Co…

1
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers 31:47

9d ago31:47

31:47

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model …

1
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher 26:22

8d ago26:22

26:22

Information seeking and integration is a complex cognitive task that consumes enormous time and effort. Inspired by the remarkable progress of Large Language Models, recent works attempt to solve this task by combining LLMs and search engines. However, these methods still obtain unsatisfying performance due to three challenges: (1) complex requests…

1
Image and Video Segmentation with SAM 2, Gemma 2 for Efficient Language Models, Boosting Small Models with Contrastive Fine-Tuning, and MM-Vet v2 Challenges Large Multimodal Models 13:40

12d ago13:40

13:40

SAM 2: Segment Anything in Images and VideosGemma 2: Improving Open Language Models at a Practical SizeCoarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language ModelImproving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuningOmniParser for Pure Vision Based GUI AgentSF3D: Stable Fast 3D Mesh Reconstructi…

1
TVV EP 28 - Romain Bouqueau: Championing Open-Source in the Streaming Industry 1:05:13

10d ago1:05:13

1:05:13

In this episode, Romain Bouqueau, CEO and Founder of Motion Spell gives us a deep look into the contributions of the open-source community in the world of video streaming. Romain also shares his insights into how open-source works, how GPAC/Motion Spell has remained ahead of the curve with its focus on R&D, and how open-source and commercial entiti…

1
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models 34:03

13d ago34:03

34:03

Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by text…

1
FinanceBench: A New Benchmark for Financial Question Answering 41:34

14d ago41:34

41:34

FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are inte…

1
Text-Guided Image Inpainting, AMEX for Mobile GUI Agents, AgentScope's Multi-Agent Simulation 14:29

14d ago14:29

14:29

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion ModelLAMBDA: A Large Model Based Data AgentAMEX: Android Multi-annotation Expo Dataset for Mobile GUI AgentsBetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth EstimationVery Large-Scale Multi-Agent Simulation in AgentScopeData Mixture Inference: What do BPE Tok…

1
TVV EP 27 - Flavio Ribeiro: The Past, Present & Future of Video Streaming 1:01:37

15d ago1:01:37

1:01:37

Tune in to hear Flavio Ribeiro, Sr. Engineering Manager of Netflix’s Live Streaming Technologies, discuss all things video streaming. Starting in the streets of Campina Grande, Flavio shares his journey from contributing to the recreation of Brazil’s digital television system and working on Globo’s live streaming platform for the 2014 FIFA World Cu…

1
Stable-Hair: Real-World Hair Transfer via Diffusion Model 30:25

15d ago30:25

30:25

Current hair transfer methods struggle to handle diverse and intricate hairstyles, thus limiting their applicability in real-world scenarios. In this paper, we propose a novel diffusion-based hair transfer framework, named \textit{Stable-Hair}, which robustly transfers a wide range of real-world hairstyles onto user-provided faces for virtual hair …

1
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? 31:03

18d ago31:03

31:03

Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI opera…

1
OpenDevin & AI Software Development, Enhancing Visual Language Models, , DDK: Refining Large Language Model Efficiency through Domain Knowledge 13:45

19d ago13:45

13:45

OpenDevin: An Open Platform for AI Software Developers as Generalist AgentsVILA^2: VILA Augmented VILAHumanVid: Demystifying Training Data for Camera-controllable Human Image AnimationPERSONA: A Reproducible Testbed for Pluralistic AlignmentSV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View ConsistencyScalify: scale propagation for…

1
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs 34:06

20d ago34:06

34:06

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generatio…

1
Patch-Level Training for Large Language Models 24:02

24d ago24:02

24:02

As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to proce…

1
Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models 35:12

23d ago35:12

35:12

We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages. This underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing. We propose STORM, a writing system f…

1
IMAGDressing-v1: Customizable Virtual Dressing 27:37

24d ago27:37

27:37

Latest advances have achieved realistic virtual try-on (VTON) through localized garment inpainting using latent diffusion models, significantly enhancing consumers' online shopping experience. However, existing VTON technologies neglect the need for merchants to showcase garments comprehensively, including flexible control over garments, optional f…

1
Vocabulary Expansion for Large Models, Big Data Enhancing LMs, 4D Reconstruction Progress, AI Cityscape Generation, DPO Policy Analysis, Expanding Code Models, Multimodal LM Trust Evaluation 14:55

22d ago14:55

14:55

Scaling Laws with Vocabulary: Larger Models Deserve Larger VocabulariesScaling Retrieval-Based Language Models with a Trillion-Token DatastoreShape of Motion: 4D Reconstruction from a Single VideoStreetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video DiffusionUnderstanding Reference Policies in Direct Preference Opti…

1
A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights 36:34

27d ago36:34

36:34

Human video generation is a dynamic and rapidly evolving task that aims to synthesize 2D human body video sequences with generative models given control conditions such as text, audio, and pose. With the potential for wide-ranging applications in film, gaming, and virtual communication, the ability to generate natural and realistic human video is c…

1
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence 49:58

28d ago49:58

49:58

The rapid advancement of large language models (LLMs) has paved the way for the development of highly capable autonomous agents. However, existing multi-agent frameworks often struggle with integrating diverse capable third-party agents due to reliance on agents defined within their own ecosystems. They also face challenges in simulating distribute…

1
Qwen2 Language Model, Mitigating Privacy Risks in LLMs, Exploring Non-Determinism, Increased Efficiency with Q-Sparse, GRUtopia for Embodied AI 10:38

27d ago10:38

10:38

Qwen2 Technical ReportLearning to Refuse: Towards Mitigating Privacy Risks in LLMsThe Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-DeterminismQ-Sparse: All Large Language Models can be Fully Sparsely-ActivatedGRUtopia: Dream General Robots in a City at Scale

1
SEED-Story: Multimodal Long Story Generation with Large Language Model 22:27

30d ago22:27

22:27

With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad app…

1
Skywork-Math's Reasoning, Video Diffusion Model Innovations, Multimodal Learning, Q-GaLore's Memory Efficiency, MAVIS: Visual Math Instruction 12:11

29d ago12:11

12:11

Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes OnVideo Diffusion Alignment via Reward GradientsMultimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language ModelQ-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank GradientsMAVIS: Math…

1
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models 39:20

1M ago39:20

39:20

While language models (LMs) have shown potential across a range of decision-making tasks, their reliance on simple acting processes limits their broad deployment as autonomous agents. In this paper, we introduce Language Agent Tree Search (LATS) -- the first general framework that synergizes the capabilities of LMs in reasoning, acting, and plannin…

1
LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control 39:35

1M ago39:35

39:35

Portrait Animation aims to synthesize a lifelike video from a single source image, using it as an appearance reference, with motion (i.e., facial expressions and head pose) derived from a driving video, audio, text, or generation. Instead of following mainstream diffusion-based methods, we explore and extend the potential of the implicit-keypoint-b…

1
Agentless: Demystifying LLM-based Software Engineering Agents 35:54

1M ago35:54

35:54

Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents…

1
Beyond Encoders in Vision-Language Models, Revolutionizing Human-LLM Interaction, and Advancing Knowledge Graphs 12:05

1M ago12:05

12:05

Unveiling Encoder-Free Vision-Language ModelsFunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMsAriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM AgentsRULE: Reliable Multimodal RAG for Factuality in Medical Vision Language ModelsChartGemma: Visual Instruction-…

1
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? 36:47

1M ago36:47

36:47

Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for speci…

1
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code 27:24

1M ago27:24

27:24

Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that a…

1
Diffusion Forcing to Expert Tuning, Structured Planning, Vision-Language Models, and Tabular ML Benchmarks 11:34

1M ago11:34

11:34

Diffusion Forcing: Next-token Prediction Meets Full-Sequence DiffusionLet the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language ModelsPlanetarium: A Rigorous Benchmark for Translating Text to Structured Planning LanguagesInternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Co…

1
Advancing AI's Mathematical Reasoning: WE-MATH, ROS-LLM Framework, Autoregressive Image Generation 10:36

1M ago10:36

10:36

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoningMMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient EvaluationLiteSearch: Efficacious Tree Search for LLMWavelets Are All You Need for Autoregressive Image…

1
Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image 22:25

1M ago22:25

22:25

In this work, we introduce Unique3D, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large…

1
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence 37:18

1M ago37:18

37:18

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Co…

1
Persona-Driven Data Synthesis, Enhancing Medical MLLMs, Robot Learning, Knowledge Distillation in LLMs, Text to 3D Gaussian Revolution 11:24

1M ago11:24

11:24

Scaling Synthetic Data Creation with 1,000,000,000 PersonasHuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at ScaleLLaRA: Supercharging Robot Learning Data for Vision-Language PolicyDirect Preference Knowledge Distillation for Large Language ModelsGaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enh…

1
OMG-LLaVA: Unifying Vision and Language Understanding, Step-DPO for LLMs Mathematical Reasoning, MUMU's Multimodal Image Generation 12:15

1M ago12:15

12:15

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and UnderstandingStep-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMsMUMU: Bootstrapping Multimodal Image Generation from Text-to-Image DataSimulating Classroom Education with LLM-Empowered AgentsSeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval …

1
TVV EP 26 - Debargha Mukherjee: A Peek Into The AV2 Future 1:00:24

1M ago1:00:24

1:00:24

Back for a second time on TheVideoVerse, Debargha Mukherjee, Principal Engineer at Google, discusses the upcoming AV2 project, touted as the successor to the popular and powerful AV1 codec. In this podcast, Debargha talks about the advent of the AV2 project and its goals. He delves into the specialized tools newly introduced in AV2, the enhancement…

1
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time 38:01

2M ago38:01

38:01

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fin…

1
FineWeb Datasets, YouDream's 3D Animals, PDE-Solving Breakthrough, Noise-Conditioned Perception Alignment, Language Models' Continual Learning 11:02

2M ago11:02

11:02

The FineWeb Datasets: Decanting the Web for the Finest Text Data at ScaleYouDream: Generating Anatomically Controllable Consistent Text-to-3D AnimalsDiffusionPDE: Generative PDE-Solving Under Partial ObservationAligning Diffusion Models with Noise-Conditioned PerceptionUnlocking Continual Learning Abilities in Language Models…

1
RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture 1:06:40

2M ago1:06:40

1:06:40

There are two common ways in which developers are incorporating proprietary and domain-specific data when building applications of Large Language Models (LLMs): Retrieval-Augmented Generation (RAG) and Fine-Tuning. RAG augments the prompt with the external data, while fine-Tuning incorporates the additional knowledge into the model itself. However,…

1
BigCodeBench Challenges, Cambrian-1 Leap, D-MERIT's Evaluation, Long Context Breakthrough in Vision 11:06

2M ago11:06

11:06

DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationBigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex InstructionsCambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMsEvaluating D-MERIT of Partial-annotation on Information RetrievalLong Context Transfer from Language to Vision…

1
Seven Failure Points When Engineering a Retrieval Augmented Generation System 21:27

2M ago21:27

21:27

Software engineers are increasingly adding semantic search capabilities to applications using a strategy known as Retrieval Augmented Generation (RAG). A RAG system involves finding documents that semantically match a query and then passing the documents to a large language model (LLM) such as ChatGPT to extract the right answer using an LLM. RAG s…

1
LongRAG Breakthrough, LLMs as Judges, Transformer Memory Insights, Video Library AI, Democratizing Art Styles 10:14

2M ago10:14

10:14

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMsJudging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-JudgesComplexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a TaskTowards Retrieval Augmented Generation over Large Video LibrariesStylebreeder: Exploring …

1
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning 42:09

2M ago42:09

42:09

Language agents perform complex tasks by using tools to execute each step precisely. However, most existing agents are based on proprietary models or designed to target specific tasks, such as mathematics or multi-hop question answering. We introduce Husky, a holistic, open-source language agent that learns to reason over a unified action space to …

1
Recurrent Context Compression: Efficiently Expanding the Context Window of LLM 38:11

2M ago38:11

38:11

To extend the context length of Transformer-based large language models (LLMs) and improve comprehension capabilities, we often face limitations due to computational resources and bounded memory storage capacity. This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of LLM…

1
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs 33:39

2M ago33:39

33:39

Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur…

1
Scaling In-Context Reinforcement Learning, ChartMimic's AI Benchmark, Multimodal Document Comprehension, Long Context Reasoning Challenges 10:36

2M ago10:36

10:36

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement LearningMake It Count: Text-to-Image Generation with an Accurate Number of ObjectsChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code GenerationNeedle In A Multimodal HaystackBABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Hay…

Podcasts Worth a Listen

Video Encoding Podcasts

Podcasts Worth a Listen

Quick Reference Guide