Best PocketPod Podcasts (2024)

1
Animal Crossing #233 - Oasis No Spaces 51:18

1d ago51:18

51:18

Welcome to Sprocket Pob colon something #PocketPod fans! This week, we consider going deep into the #HannaBarbera universe, rejoice in the shortlived #ACNH island tours, and get into the Rar Experience. And please pay no attention to careless whispers. --- Patreon Members Only: View this episode as a Vodcast! --- Support us for $1/$2 on Patreon! ht…

1
Vocabulary Expansion for Large Models, Big Data Enhancing LMs, 4D Reconstruction Progress, AI Cityscape Generation, DPO Policy Analysis, Expanding Code Models, Multimodal LM Trust Evaluation 14:55

6h ago14:55

14:55

Scaling Laws with Vocabulary: Larger Models Deserve Larger VocabulariesScaling Retrieval-Based Language Models with a Trillion-Token DatastoreShape of Motion: 4D Reconstruction from a Single VideoStreetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video DiffusionUnderstanding Reference Policies in Direct Preference Opti…

1
Qwen2 Language Model, Mitigating Privacy Risks in LLMs, Exploring Non-Determinism, Increased Efficiency with Q-Sparse, GRUtopia for Embodied AI 10:38

4h ago10:38

10:38

Qwen2 Technical ReportLearning to Refuse: Towards Mitigating Privacy Risks in LLMsThe Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-DeterminismQ-Sparse: All Large Language Models can be Fully Sparsely-ActivatedGRUtopia: Dream General Robots in a City at Scale

1
Skywork-Math's Reasoning, Video Diffusion Model Innovations, Multimodal Learning, Q-GaLore's Memory Efficiency, MAVIS: Visual Math Instruction 12:11

4h ago12:11

12:11

Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes OnVideo Diffusion Alignment via Reward GradientsMultimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language ModelQ-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank GradientsMAVIS: Math…

1
Beyond Encoders in Vision-Language Models, Revolutionizing Human-LLM Interaction, and Advancing Knowledge Graphs 12:05

3d ago12:05

12:05

Unveiling Encoder-Free Vision-Language ModelsFunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMsAriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM AgentsRULE: Reliable Multimodal RAG for Factuality in Medical Vision Language ModelsChartGemma: Visual Instruction-…

1
Diffusion Forcing to Expert Tuning, Structured Planning, Vision-Language Models, and Tabular ML Benchmarks 11:34

10d ago11:34

11:34

Diffusion Forcing: Next-token Prediction Meets Full-Sequence DiffusionLet the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language ModelsPlanetarium: A Rigorous Benchmark for Translating Text to Structured Planning LanguagesInternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Co…

1
Advancing AI's Mathematical Reasoning: WE-MATH, ROS-LLM Framework, Autoregressive Image Generation 10:36

7d ago10:36

10:36

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoningMMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient EvaluationLiteSearch: Efficacious Tree Search for LLMWavelets Are All You Need for Autoregressive Image…

1
Persona-Driven Data Synthesis, Enhancing Medical MLLMs, Robot Learning, Knowledge Distillation in LLMs, Text to 3D Gaussian Revolution 11:24

10d ago11:24

11:24

Scaling Synthetic Data Creation with 1,000,000,000 PersonasHuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at ScaleLLaRA: Supercharging Robot Learning Data for Vision-Language PolicyDirect Preference Knowledge Distillation for Large Language ModelsGaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enh…

1
OMG-LLaVA: Unifying Vision and Language Understanding, Step-DPO for LLMs Mathematical Reasoning, MUMU's Multimodal Image Generation 12:15

11d ago12:15

12:15

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and UnderstandingStep-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMsMUMU: Bootstrapping Multimodal Image Generation from Text-to-Image DataSimulating Classroom Education with LLM-Empowered AgentsSeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval …

1
FineWeb Datasets, YouDream's 3D Animals, PDE-Solving Breakthrough, Noise-Conditioned Perception Alignment, Language Models' Continual Learning 11:02

15d ago11:02

11:02

The FineWeb Datasets: Decanting the Web for the Finest Text Data at ScaleYouDream: Generating Anatomically Controllable Consistent Text-to-3D AnimalsDiffusionPDE: Generative PDE-Solving Under Partial ObservationAligning Diffusion Models with Noise-Conditioned PerceptionUnlocking Continual Learning Abilities in Language Models…

1
BigCodeBench Challenges, Cambrian-1 Leap, D-MERIT's Evaluation, Long Context Breakthrough in Vision 11:06

16d ago11:06

11:06

DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationBigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex InstructionsCambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMsEvaluating D-MERIT of Partial-annotation on Information RetrievalLong Context Transfer from Language to Vision…

1
LongRAG Breakthrough, LLMs as Judges, Transformer Memory Insights, Video Library AI, Democratizing Art Styles 10:14

17d ago10:14

10:14

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMsJudging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-JudgesComplexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a TaskTowards Retrieval Augmented Generation over Large Video LibrariesStylebreeder: Exploring …

1
Animal Crossing #232 - Wild Wonderful and Totally Twisted 1:11:23

22d ago1:11:23

1:11:23

This week we hope (not) to see natural disasters from a safe, social distance while wearing bike helmets. We dig deep into the next sets of #AnimalCrossing #Lego sets and the new Super Mario Land in Orlando. We also talk about #birds, like a lot. --- Patreon Members Only: View this episode as a Vodcast! --- Join our Patreon! https://patreon.com/the…

1
Scaling In-Context Reinforcement Learning, ChartMimic's AI Benchmark, Multimodal Document Comprehension, Long Context Reasoning Challenges 10:36

23d ago10:36

10:36

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement LearningMake It Count: Text-to-Image Generation with an Accurate Number of ObjectsChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code GenerationNeedle In A Multimodal HaystackBABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Hay…

1
Revolutionizing Vision and Language Models: Depth Prediction Breakthroughs, Pixel-Level Transformers, and Robotic Skill Learning 13:20

24d ago13:20

13:20

Depth Anything V2An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual PixelsTransformers meet Neural Algorithmic ReasonersSamba: Simple Hybrid State Space Models for Efficient Unlimited Context Language ModelingOpenVLA: An Open-Source Vision-Language-Action ModelAlleviating Distortion in Image Generation via Multi-Resolut…

1
NaRCan Revolutionizes Video Editing, Training-Free Video Generation, Recaptioning Web Images with LLaMA-3, Novel Data Synthesis Approach, Smartphone LLM Inference 11:33

28d ago11:33

11:33

NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video EditingMotionClone: Training-Free Motion Cloning for Controllable Video GenerationWhat If We Recaption Billions of Web Images with LLaMA-3?Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with NothingPowerInfer-2: Fast Large Language Model I…

1
Revolutionizing Image Synthesis with TiTok, Multilingual Code Benchmark, Exploring GenAI Prompting Techniques, 10:53

1M ago10:53

10:53

An Image is Worth 32 Tokens for Reconstruction and GenerationMcEval: Massively Multilingual Code EvaluationZero-shot Image Editing with Reference ImitationThe Prompt Report: A Systematic Survey of Prompting TechniquesTextGrad: Automatic "Differentiation" via Text

1
LlamaGen's Image Revolution, Husky: The Multi-Step Reasoner, Vript's Video Breakthrough, VALL-E 2 Achieves Human Parity 10:46

1M ago10:46

10:46

Autoregressive Model Beats Diffusion: Llama for Scalable Image GenerationHusky: A Unified, Open-Source Language Agent for Multi-Step ReasoningVript: A Video Is Worth Thousands of WordsLighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View SynthesisVALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text …

1
Mixture-of-Agents, Benchmarking LLMs, and GenAI Arena Evaluation 11:06

1M ago11:06

11:06

Mixture-of-Agents Enhances Large Language Model CapabilitiesWildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the WildCRAG -- Comprehensive RAG BenchmarkGenAI Arena: An Open Evaluation Platform for Generative ModelsLarge Language Model Confidence Estimation via Black-Box Access

1
Enhancing AI Video and Image Generation, BitsFusion Quantization, Step-aware Optimization, Thought-Augmented Reasoning, and Single Forward Video Generation 11:39

1M ago11:39

11:39

ShareGPT4Video: Improving Video Understanding and Generation with Better CaptionsBitsFusion: 1.99 bits Weight Quantization of Diffusion ModelStep-aware Preference Optimization: Aligning Preference with Denoising Performance at Each StepBuffer of Thoughts: Thought-Augmented Reasoning with Large Language ModelsSF-V: Single Forward Video Generation Mo…

1
AI Papers Podcast Special Edition: Apple Intelligence & Ferret-UI 1:52

1M ago1:52

1:52

Apple announced new Siri features and Apple Intelligence today, Interestingly, Apple already released a paper, titled "Ferret-UI," on how it all works - a multimodal vision-language model capable of understanding widgets, icons, and text on an iOS mobile screen, and reasoning about their spatial relationships and functional meanings. https://arxiv.…

1
Block Transformers: Faster Inference, Mobile Device AI Agents, 3D-Image Generation, Low Latency TTS 10:41

1M ago10:41

10:41

Block Transformer: Global-to-Local Language Modeling for Fast InferenceParrot: Multilingual Visual Instruction TuningMobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent CollaborationOuroboros3D: Image-to-3D Generation via 3D-aware Recursive DiffusionLiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autore…

1
Seed-TTS, Decoding LLMs, Innovations in Text-to-Video, Self-Improving AI Preferences, and Refining Diffusion Models 11:10

2M ago11:10

11:10

Seed-TTS: A Family of High-Quality Versatile Speech Generation ModelsTo Believe or Not to Believe Your LLMI4VGen: Image as Stepping Stone for Text-to-Video GenerationSelf-Improving Robust Preference OptimizationGuiding a Diffusion Model with a Bad Version of Itself

1
MMLU-Pro: Next-Level Language Understanding, Tailored LLMs, High FPS Video Generation Innovation 11:30

1M ago11:30

11:30

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding BenchmarkLearning Temporally Consistent Video Depth from Video Diffusion PriorsShow, Don't Tell: Aligning Language Models with Demonstrated FeedbackArtificial Generational Intelligence: Cultural Accumulation in Reinforcement LearningZeroSmooth: Training-free Diffuser Adaptati…

1
Transformers and State-Space Models Unite, Multi-modal LLM Benchmark, Perplexity in Data Pruning, Advancing 4D Content Generation 10:23

2M ago10:23

10:23

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space DualityVideo-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video AnalysisPerplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference ModelsKaleido Diffusion: Improving Conditional Diffusion Models with Au…

1
DITTO-2 Speeds Up Music AI, GECO's Quick 3D Generation, PLA4D's 4D Advances, DevEval's Real-World Code Benchmark, Parrot's LLM Application Efficiency 10:47

2M ago10:47

10:47

AI Papers Podcast for 06/04/2024 DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music GenerationGECO: Generative Image-to-3D within a SECOndPLA4D: Pixel-Level Alignments for Text-to-4D Gaussian SplattingDevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code RepositoriesParrot: Efficient Serving of LLM-b…

1
Boosting Text Retrieval with CLIP Models, Rethinking Retrieval Augmented Generation, and Deciphering Human Behavior through MotionLLM 10:42

2M ago10:42

10:42

AI Papers Podcast for 06/03/2024 Jina CLIP: Your CLIP Model Is Also Your Text RetrieverSimilarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered ThoughtsMotionLLM: Understanding Human Behaviors from Human Motions and VideosXwin-LM: Strong and Scalable Alignment Practice for LLMsMOFA-Video: Controllable Image Animati…

1
Bilingual LLM Transparency, T2V-Turbo's Video Generation, LLMs Surpassing Human Theory of Mind Performance, Advancements in LLM Attribution 8:47

2M ago8:47

8:47

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model SeriesT2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward FeedbackLLMs achieve adult human performance on higher-order theory of mind tasksNearest Neighbor Speculative Decoding for LLM Generation and AttributionZipper: A Multi-Tower Decoder Ar…

1
Phased Consistency Model, 2-Stage Backpropagation, and the Future of 4D World Reconstruction 8:09

2M ago8:09

8:09

Phased Consistency Model2BP: 2-Stage BackpropagationGFlow: Recovering 4D World from Monocular VideoInstruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction TuningLLaMA-NAS: Efficient Neural Architecture Search for Large Language Models

1
Vision-Language Models, Arithmetic Transformers, Next-Gen Video Editing: 10:20

2M ago10:20

10:20

An Introduction to Vision-Language ModelingTransformers Can Do Arithmetic with the Right EmbeddingsMatryoshka Multimodal ModelsI2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion ModelsZamba: A Compact 7B SSM Hybrid ModelLooking Backward: Streaming Video-to-Video Translation with Feature Banks…

1
ConvLLaVA's Visual Compression, Efficient LLVM, Multilingual Aya 23, and AutoCoder's Code Mastery 11:11

2M ago11:11

11:11

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal ModelsMeteor: Mamba-based Traversal of Rationale for Large Language and Vision ModelsGrokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of GeneralizationAya 23: Open Weight Releases to Further Multilingual ProgressStacking Your Transformers: A Close…

3M ago9:19

9:19

AI Papers Podcast for 04/29/2024 PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense CaptioningAdvPrompter: Fast Adaptive Adversarial Prompting for LLMsHaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo CollectionsMaPa: Text-driven Photorealistic Material Painting for 3D Shapes…

Podcasts Worth a Listen

PocketPod Podcasts

Podcasts Worth a Listen

Quick Reference Guide