show episodes
 
A daily update on the latest AI Research Papers. We provide a high level overview of a handful of papers each day and will link all papers in the description for further reading. This podcast is created entirely with AI by PocketPod. Head over to https://pocketpod.app to learn more.
  continue reading
 
Loading …
show series
 
This week we hope (not) to see natural disasters from a safe, social distance while wearing bike helmets. We dig deep into the next sets of #AnimalCrossing #Lego sets and the new Super Mario Land in Orlando. We also talk about #birds, like a lot. --- Patreon Members Only: View this episode as a Vodcast! --- Join our Patreon! https://patreon.com/the…
  continue reading
 
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoningMMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient EvaluationLiteSearch: Efficacious Tree Search for LLMWavelets Are All You Need for Autoregressive Image…
  continue reading
 
Scaling Synthetic Data Creation with 1,000,000,000 PersonasHuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at ScaleLLaRA: Supercharging Robot Learning Data for Vision-Language PolicyDirect Preference Knowledge Distillation for Large Language ModelsGaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enh…
  continue reading
 
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and UnderstandingStep-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMsMUMU: Bootstrapping Multimodal Image Generation from Text-to-Image DataSimulating Classroom Education with LLM-Empowered AgentsSeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval …
  continue reading
 
The FineWeb Datasets: Decanting the Web for the Finest Text Data at ScaleYouDream: Generating Anatomically Controllable Consistent Text-to-3D AnimalsDiffusionPDE: Generative PDE-Solving Under Partial ObservationAligning Diffusion Models with Noise-Conditioned PerceptionUnlocking Continual Learning Abilities in Language Models…
  continue reading
 
DreamBench++: A Human-Aligned Benchmark for Personalized Image GenerationBigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex InstructionsCambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMsEvaluating D-MERIT of Partial-annotation on Information RetrievalLong Context Transfer from Language to Vision…
  continue reading
 
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMsJudging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-JudgesComplexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a TaskTowards Retrieval Augmented Generation over Large Video LibrariesStylebreeder: Exploring …
  continue reading
 
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement LearningMake It Count: Text-to-Image Generation with an Accurate Number of ObjectsChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code GenerationNeedle In A Multimodal HaystackBABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Hay…
  continue reading
 
Depth Anything V2An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual PixelsTransformers meet Neural Algorithmic ReasonersSamba: Simple Hybrid State Space Models for Efficient Unlimited Context Language ModelingOpenVLA: An Open-Source Vision-Language-Action ModelAlleviating Distortion in Image Generation via Multi-Resolut…
  continue reading
 
NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video EditingMotionClone: Training-Free Motion Cloning for Controllable Video GenerationWhat If We Recaption Billions of Web Images with LLaMA-3?Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with NothingPowerInfer-2: Fast Large Language Model I…
  continue reading
 
An Image is Worth 32 Tokens for Reconstruction and GenerationMcEval: Massively Multilingual Code EvaluationZero-shot Image Editing with Reference ImitationThe Prompt Report: A Systematic Survey of Prompting TechniquesTextGrad: Automatic "Differentiation" via Text
  continue reading
 
Autoregressive Model Beats Diffusion: Llama for Scalable Image GenerationHusky: A Unified, Open-Source Language Agent for Multi-Step ReasoningVript: A Video Is Worth Thousands of WordsLighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View SynthesisVALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text …
  continue reading
 
ShareGPT4Video: Improving Video Understanding and Generation with Better CaptionsBitsFusion: 1.99 bits Weight Quantization of Diffusion ModelStep-aware Preference Optimization: Aligning Preference with Denoising Performance at Each StepBuffer of Thoughts: Thought-Augmented Reasoning with Large Language ModelsSF-V: Single Forward Video Generation Mo…
  continue reading
 
Apple announced new Siri features and Apple Intelligence today, Interestingly, Apple already released a paper, titled "Ferret-UI," on how it all works - a multimodal vision-language model capable of understanding widgets, icons, and text on an iOS mobile screen, and reasoning about their spatial relationships and functional meanings. https://arxiv.…
  continue reading
 
Block Transformer: Global-to-Local Language Modeling for Fast InferenceParrot: Multilingual Visual Instruction TuningMobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent CollaborationOuroboros3D: Image-to-3D Generation via 3D-aware Recursive DiffusionLiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autore…
  continue reading
 
Seed-TTS: A Family of High-Quality Versatile Speech Generation ModelsTo Believe or Not to Believe Your LLMI4VGen: Image as Stepping Stone for Text-to-Video GenerationSelf-Improving Robust Preference OptimizationGuiding a Diffusion Model with a Bad Version of Itself
  continue reading
 
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding BenchmarkLearning Temporally Consistent Video Depth from Video Diffusion PriorsShow, Don't Tell: Aligning Language Models with Demonstrated FeedbackArtificial Generational Intelligence: Cultural Accumulation in Reinforcement LearningZeroSmooth: Training-free Diffuser Adaptati…
  continue reading
 
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space DualityVideo-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video AnalysisPerplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference ModelsKaleido Diffusion: Improving Conditional Diffusion Models with Au…
  continue reading
 
AI Papers Podcast for 06/04/2024 DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music GenerationGECO: Generative Image-to-3D within a SECOndPLA4D: Pixel-Level Alignments for Text-to-4D Gaussian SplattingDevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code RepositoriesParrot: Efficient Serving of LLM-b…
  continue reading
 
AI Papers Podcast for 06/03/2024 Jina CLIP: Your CLIP Model Is Also Your Text RetrieverSimilarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered ThoughtsMotionLLM: Understanding Human Behaviors from Human Motions and VideosXwin-LM: Strong and Scalable Alignment Practice for LLMsMOFA-Video: Controllable Image Animati…
  continue reading
 
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model SeriesT2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward FeedbackLLMs achieve adult human performance on higher-order theory of mind tasksNearest Neighbor Speculative Decoding for LLM Generation and AttributionZipper: A Multi-Tower Decoder Ar…
  continue reading
 
An Introduction to Vision-Language ModelingTransformers Can Do Arithmetic with the Right EmbeddingsMatryoshka Multimodal ModelsI2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion ModelsZamba: A Compact 7B SSM Hybrid ModelLooking Backward: Streaming Video-to-Video Translation with Feature Banks…
  continue reading
 
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal ModelsMeteor: Mamba-based Traversal of Rationale for Large Language and Vision ModelsGrokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of GeneralizationAya 23: Open Weight Releases to Further Multilingual ProgressStacking Your Transformers: A Close…
  continue reading
 
ReVideo: Remake a Video with Motion and Content ControlNot All Language Model Features Are LinearRectifID: Personalizing Rectified Flow with Anchored Classifier GuidanceVisual Echoes: A Simple Unified Transformer for Audio-Visual GenerationDeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic DataDense Connector for MLLMs…
  continue reading
 
Your Transformer is Secretly LinearDiffusion for World Modeling: Visual Details Matter in AtariFace Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute ControlReducing Transformer Key-Value Cache Size with Cross-Layer AttentionOmniGlue: Generalizable Feature Matching with Foundation Model GuidancePersonalized Residuals for C…
  continue reading
 
FIFO-Diffusion: Generating Infinite Videos from Text without TrainingMoRA: High-Rank Updating for Parameter-Efficient Fine-TuningOpenRLHF: An Easy-to-use, Scalable and High-performance RLHF FrameworkImp: Highly Capable Large Multimodal Models for Mobile DevicesOcto: An Open-Source Generalist Robot PolicyTowards Modular LLMs by Building and Reusing …
  continue reading
 
INDUS: Effective and Efficient Language Models for Scientific ApplicationsObservational Scaling Laws and the Predictability of Language Model PerformanceGrounded 3D-LLM with Referent TokensLayer-Condensed KV Cache for Efficient Inference of Large Language ModelsDynamic data sampler for cross-language transfer learning in large language models…
  continue reading
 
Chameleon: Mixed-Modal Early-Fusion Foundation ModelsLoRA Learns Less and Forgets LessMany-Shot In-Context Learning in Multimodal Foundation ModelsCAT3D: Create Anything in 3D with Multi-View Diffusion ModelsGrounding DINO 1.5: Advance the "Edge" of Open-Set Object DetectionDual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode…
  continue reading
 
ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language ModelsXmodel-VLM: A Simple Baseline for Multimodal Vision Language ModelBEHAVIOR Vision Suite: Customizable Dataset Generation via SimulationNaturalistic Music Decoding from EEG Data via Latent Diffusion ModelsNo Time to Waste: Squeeze Time into Channel for Mobile Vide…
  continue reading
 
VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion ModelsBeyond Scaling Laws: Understanding Transformer Performance with Associative MemoryCoin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided ConditioningHunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Unde…
  continue reading
 
What matters when building vision-language models?RLHF Workflow: From Reward Modeling to Online RLHFSUTRA: Scalable Multilingual Language Model ArchitectureSambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of ExpertsPlot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from …
  continue reading
 
BlenderAlchemy: Editing 3D Graphics with Vision-Language ModelsStylus: Automatic Adapter Selection for Diffusion ModelsAg2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action RepresentationsDressCode: Autoregressively Sewing and Generating Garments from Text GuidancePLLaVA : Parameter-free LLaVA Extension from Images to V…
  continue reading
 
This week, we're catching up on our spring break activities, including pottery, celestial events, and Passover. We also hop into #AnimalCrossing #NewHorizons to ponder a koala and play vacuuming. And we learn... how to type? --- Patreon Members Only: View this episode as a Vodcast! --- Join our Patreon! https://patreon.com/thepocketpod Visit our We…
  continue reading
 
MotionLCM: Real-time Controllable Motion Generation via Latent Consistency ModelVisual Fact Checker: Enabling High-Fidelity Detailed Caption GenerationGS-LRM: Large Reconstruction Model for 3D Gaussian SplattingSAGS: Structure-Aware 3D Gaussian SplattingInvisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting…
  continue reading
 
Paint by Inpaint: Learning to Add Image Objects by Removing Them FirstSelf-Play Preference Optimization for Language Model AlignmentAutomatic Creative Selection with Cross-Modal MatchingSTT: Stateful Tracking with Transformers for Autonomous DrivingOctopus v4: Graph of language models
  continue reading
 
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language ModelsWildChat: 1M ChatGPT Interaction Logs in the WildStoryDiffusion: Consistent Self-Attention for Long-Range Image and Video GenerationLoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical ReportLLM-AD: Large Language Model based Audio Description System…
  continue reading
 
Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3A Careful Examination of Large Language Model Performance on Grade School ArithmeticSpectrally Pruned Gaussian Fields with Neural CompensationSemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General SoundClover: Regressive Lightweight Speculative …
  continue reading
 
Octopus v4: Graph of language modelsInstantFamily: Masked Attention for Zero-shot Multi-ID Image GenerationBetter & Faster Large Language Models via Multi-token PredictionGS-LRM: Large Reconstruction Model for 3D Gaussian SplattingIterative Reasoning Preference Optimization
  continue reading
 
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse ModelsLEGENT: Open Platform for Embodied AgentsAg2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action RepresentationsKangaroo: Lossless Self-Speculative Decoding via Double Early ExitingBlenderAlchemy: Editing 3D Graphics with Vision-Languag…
  continue reading
 
AI Papers Podcast for 04/29/2024 PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense CaptioningAdvPrompter: Fast Adaptive Adversarial Prompting for LLMsHaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo CollectionsMaPa: Text-driven Photorealistic Material Painting for 3D Shapes…
  continue reading
 
AI Papers Podcast for 04/26/2024 How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source SuitesInteractive3D: Create What You Want by Interactive 3D GenerationLayer Skip: Enabling Early Exit Inference and Self-Speculative DecodingTele-FLM Technical ReportSEED-Bench-2-Plus: Benchmarking Multimodal Large Language Mo…
  continue reading
 
AI Papers Podcast for 04/25/2024 Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image SynthesisA Multimodal Automated Interpretability AgentSEED-X: Multimodal Models with Unified Multi-granularity Comprehension and GenerationMultiBooth: Towards Generating All Your Concepts in an Image from TextLearning H-Infinity Locomotion Control…
  continue reading
 
AI Papers Podcast for 04/24/2024 OpenELM: An Efficient Language Model Family with Open-source Training and Inference FrameworkMulti-Head Mixture-of-ExpertsPegasus-v1 Technical ReportAlign Your Steps: Optimizing Sampling Schedules in Diffusion ModelsSnapKV: LLM Knows What You are Looking for Before Generation…
  continue reading
 
AI Papers Podcast for 04/23/2024 PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation TextSquare: Scaling up Text-Centric Visual Instruction Tuning Does Gaussian Splatting need SFM Initialization? How Far Can We Go with Practical Function-Level Program Repair? AutoCrawler: A Progressive Understanding Web Agent for Web Crawler…
  continue reading
 
AI Papers Podcast for 04/21/2024Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion ModelHQ-Edit: A High-Quality Dataset for Instruction-based Image EditingTango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference OptimizationTextHawk: Exploring Efficient Fine-Grained Percept…
  continue reading
 
Loading …

Quick Reference Guide