Best Shaping Sapiens Podcasts (2024)

1
Sapiens: Unveiling the Secrets of Our Ancestors 10:24

8h ago10:24

10:24

All right, buckle up, because this deep dive is going to be a wild ride through human history. Oh, I love a good history deep dive. Who doesn't, right? And today, we're going way back to the time of our early ancestors with Yuval Noah Harari's Sapiens. Ever imagine sharing the planet with other human species? It's mind-blowing, isn't it? This isn't…

1
LLaMA-Omni: Seamless Speech Interaction with Large Language Models 32:15

2h ago32:15

32:15

Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel m…

1
GeoCalib: Learning Single-image Calibration with Geometric Optimization 19:16

2h ago19:16

19:16

From a single image, visual cues can help deduce intrinsic and extrinsic camera parameters like the focal length and the gravity direction. This single-image calibration can benefit various downstream applications like image editing and 3D mapping. Current approaches to this problem are based on either classical geometry with lines and vanishing po…

1
Artificial Immune System of Secure Face Recognition Against Adversarial Attacks 1:10:54

15m ago1:10:54

1:10:54

Insect production for food and feed presents a promising supplement to ensure food safety and address the adverse impacts of agriculture on climate and environment in the future. However, optimisation is required for insect production to realise its full potential. This can be by targeted improvement of traits of interest through selective breeding…

1
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model 29:24

20h ago29:24

29:24

Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for…

1
rerankers: A Lightweight Python Library to Unify Ranking Methods 15:39

2d ago15:39

15:39

This paper presents rerankers, a Python library which provides an easy-to-use interface to the most commonly used re-ranking approaches. Re-ranking is an integral component of many retrieval pipelines; however, there exist numerous approaches to it, relying on different implementation methods. rerankers unifies these methods into a single user-frie…

1
Automated Design of Agentic Systems 23:55

3d ago23:55

23:55

Researchers are investing substantial effort in developing powerful general-purpose agents, wherein Foundation Models are used as modules within agentic systems (e.g. Chain-of-Thought, Self-Reflection, Toolformer). However, the history of machine learning teaches us that hand-designed solutions are eventually replaced by learned solutions. We formu…

1
Text2SQL is Not Enough: Unifying AI and Databases with TAG 42:53

4d ago42:53

42:53

AI systems that serve natural language questions over databases promise to unlock tremendous value. Such systems would allow users to leverage the powerful reasoning and knowledge capabilities of language models (LMs) alongside the scalable computational power of data management systems. These combined capabilities would empower users to ask arbitr…

1
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders 35:05

8d ago35:05

35:05

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of rec…

1
Sapiens: Foundation for Human Vision Models 25:58

9d ago25:58

25:58

We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 milli…

1
OctFusion: Octree-based Diffusion Models for 3D Shape Generation 33:00

10d ago33:00

33:00

Diffusion models have emerged as a popular method for 3D generation. However, it is still challenging for diffusion models to efficiently generate diverse and high-quality 3D shapes. In this paper, we introduce OctFusion, which can generate 3D shapes with arbitrary resolutions in 2.5 seconds on a single Nvidia 4090 GPU, and the extracted meshes are…

1
Writing in the Margins: Better Inference Pattern for Long Context Retrieval 29:22

11d ago29:22

29:22

In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive conte…

1
Fact Finder -- Enhancing Domain Expertise of Large Language Models by Incorporating Knowledge Graphs 19:53

14d ago19:53

19:53

Recent advancements in Large Language Models (LLMs) have showcased their proficiency in answering natural language queries. However, their effectiveness is hindered by limited domain-specific knowledge, raising concerns about the reliability of their responses. We introduce a hybrid system that augments LLMs with domain-specific knowledge graphs (K…

1
RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation 18:01

15d ago18:01

18:01

Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention. However, even the most advanced LLMs face challenges such as hallucinations and real-time updating of their knowledge. Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval…

1
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation 27:28

16d ago27:28

27:28

Despite Retrieval-Augmented Generation (RAG) showing promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses and reliability of measurements. In this paper, we propose a fine-grained evaluation framework, RAGChecker, tha…

1
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search 47:39

21d ago47:39

47:39

We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal the…

1
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs 38:53

26d ago38:53

38:53

Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other …

1
ControlNeXt: Powerful and Efficient Control for Image and Video Generation 26:50

27d ago26:50

26:50

Diffusion models have demonstrated remarkable and robust abilities in both image and video generation. To achieve greater control over generated results, researchers introduce additional architectures, such as ControlNet, Adapters and ReferenceNet, to integrate conditioning controls. However, current controllable generation methods often require su…

1
OpenResearcher: Unleashing AI for Accelerated Scientific Research 29:59

1M ago29:59

29:59

The rapid growth of scientific literature imposes significant challenges for researchers endeavoring to stay updated with the latest advancements in their fields and delve into new areas. We introduce OpenResearcher, an innovative platform that leverages Artificial Intelligence (AI) techniques to accelerate the research process by answering diverse…

1
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation 33:50

1M ago33:50

33:50

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this…

1
AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls 41:29

1M ago41:29

41:29

We introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries. We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries. AnyTool primarily incorporates three elements: an API retrieve…

1
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads 38:55

1M ago38:55

38:55

Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been…

1
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders 29:11

1M ago29:11

29:11

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-…

1
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers 31:47

1M ago31:47

31:47

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model …

1
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher 26:22

1M ago26:22

26:22

Information seeking and integration is a complex cognitive task that consumes enormous time and effort. Inspired by the remarkable progress of Large Language Models, recent works attempt to solve this task by combining LLMs and search engines. However, these methods still obtain unsatisfying performance due to three challenges: (1) complex requests…

1
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models 34:03

2M ago34:03

34:03

Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by text…

1
FinanceBench: A New Benchmark for Financial Question Answering 41:34

2M ago41:34

41:34

FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are inte…

1
Stable-Hair: Real-World Hair Transfer via Diffusion Model 30:25

2M ago30:25

30:25

Current hair transfer methods struggle to handle diverse and intricate hairstyles, thus limiting their applicability in real-world scenarios. In this paper, we propose a novel diffusion-based hair transfer framework, named \textit{Stable-Hair}, which robustly transfers a wide range of real-world hairstyles onto user-provided faces for virtual hair …

1
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? 31:03

2M ago31:03

31:03

Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI opera…

1
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs 34:06

2M ago34:06

34:06

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generatio…

1
Patch-Level Training for Large Language Models 24:02

2M ago24:02

24:02

As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to proce…

1
Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models 35:12

2M ago35:12

35:12

We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages. This underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing. We propose STORM, a writing system f…

1
IMAGDressing-v1: Customizable Virtual Dressing 27:37

2M ago27:37

27:37

Latest advances have achieved realistic virtual try-on (VTON) through localized garment inpainting using latent diffusion models, significantly enhancing consumers' online shopping experience. However, existing VTON technologies neglect the need for merchants to showcase garments comprehensively, including flexible control over garments, optional f…

1
A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights 36:34

2M ago36:34

36:34

Human video generation is a dynamic and rapidly evolving task that aims to synthesize 2D human body video sequences with generative models given control conditions such as text, audio, and pose. With the potential for wide-ranging applications in film, gaming, and virtual communication, the ability to generate natural and realistic human video is c…

1
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence 49:58

2M ago49:58

49:58

The rapid advancement of large language models (LLMs) has paved the way for the development of highly capable autonomous agents. However, existing multi-agent frameworks often struggle with integrating diverse capable third-party agents due to reliance on agents defined within their own ecosystems. They also face challenges in simulating distribute…

1
SEED-Story: Multimodal Long Story Generation with Large Language Model 22:27

2M ago22:27

22:27

With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad app…

1
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models 39:20

2M ago39:20

39:20

While language models (LMs) have shown potential across a range of decision-making tasks, their reliance on simple acting processes limits their broad deployment as autonomous agents. In this paper, we introduce Language Agent Tree Search (LATS) -- the first general framework that synergizes the capabilities of LMs in reasoning, acting, and plannin…

1
LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control 39:35

2M ago39:35

39:35

Portrait Animation aims to synthesize a lifelike video from a single source image, using it as an appearance reference, with motion (i.e., facial expressions and head pose) derived from a driving video, audio, text, or generation. Instead of following mainstream diffusion-based methods, we explore and extend the potential of the implicit-keypoint-b…

1
Agentless: Demystifying LLM-based Software Engineering Agents 35:54

2M ago35:54

35:54

Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents…

1
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? 36:47

2M ago36:47

36:47

Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for speci…

1
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code 27:24

2M ago27:24

27:24

Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that a…

1
Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image 22:25

3M ago22:25

22:25

In this work, we introduce Unique3D, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large…

1
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence 37:18

3M ago37:18

37:18

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Co…

1
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time 38:01

3M ago38:01

38:01

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fin…

1
RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture 1:06:40

3M ago1:06:40

1:06:40

There are two common ways in which developers are incorporating proprietary and domain-specific data when building applications of Large Language Models (LLMs): Retrieval-Augmented Generation (RAG) and Fine-Tuning. RAG augments the prompt with the external data, while fine-Tuning incorporates the additional knowledge into the model itself. However,…

1
Seven Failure Points When Engineering a Retrieval Augmented Generation System 21:27

3M ago21:27

21:27

Software engineers are increasingly adding semantic search capabilities to applications using a strategy known as Retrieval Augmented Generation (RAG). A RAG system involves finding documents that semantically match a query and then passing the documents to a large language model (LLM) such as ChatGPT to extract the right answer using an LLM. RAG s…

1
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning 42:09

3M ago42:09

42:09

Language agents perform complex tasks by using tools to execute each step precisely. However, most existing agents are based on proprietary models or designed to target specific tasks, such as mathematics or multi-hop question answering. We introduce Husky, a holistic, open-source language agent that learns to reason over a unified action space to …

1
Recurrent Context Compression: Efficiently Expanding the Context Window of LLM 38:11

3M ago38:11

38:11

To extend the context length of Transformer-based large language models (LLMs) and improve comprehension capabilities, we often face limitations due to computational resources and bounded memory storage capacity. This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of LLM…

1
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs 33:39

3M ago33:39

33:39

Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur…

1
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning 40:44

3M ago40:44

40:44

Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opp…

Podcasts Worth a Listen

Shaping Sapiens Podcasts

Podcasts Worth a Listen

Quick Reference Guide