Best Cinemosity Podcasts (2024)

1
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation 33:50

2h ago33:50

33:50

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this…

1
AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls 41:29

1h ago41:29

41:29

We introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries. We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries. AnyTool primarily incorporates three elements: an API retrieve…

1
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads 38:55

7d ago38:55

38:55

Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been…

1
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders 29:11

5d ago29:11

29:11

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-…

1
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers 31:47

9d ago31:47

31:47

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model …

1
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher 26:22

8d ago26:22

26:22

Information seeking and integration is a complex cognitive task that consumes enormous time and effort. Inspired by the remarkable progress of Large Language Models, recent works attempt to solve this task by combining LLMs and search engines. However, these methods still obtain unsatisfying performance due to three challenges: (1) complex requests…

1
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models 34:03

13d ago34:03

34:03

Diffusion models have achieved great progress in image animation due to powerful generative capabilities. However, maintaining spatio-temporal consistency with detailed information from the input static image over time (e.g., style, background, and object of the input static image) and ensuring smoothness in animated video narratives guided by text…

1
FinanceBench: A New Benchmark for Financial Question Answering 41:34

14d ago41:34

41:34

FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are inte…

1
Stable-Hair: Real-World Hair Transfer via Diffusion Model 30:25

15d ago30:25

30:25

Current hair transfer methods struggle to handle diverse and intricate hairstyles, thus limiting their applicability in real-world scenarios. In this paper, we propose a novel diffusion-based hair transfer framework, named \textit{Stable-Hair}, which robustly transfers a wide range of real-world hairstyles onto user-provided faces for virtual hair …

1
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? 31:03

18d ago31:03

31:03

Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI opera…

1
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs 34:06

20d ago34:06

34:06

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generatio…

1
Patch-Level Training for Large Language Models 24:02

24d ago24:02

24:02

As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to proce…

1
Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models 35:12

23d ago35:12

35:12

We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages. This underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing. We propose STORM, a writing system f…

1
IMAGDressing-v1: Customizable Virtual Dressing 27:37

24d ago27:37

27:37

Latest advances have achieved realistic virtual try-on (VTON) through localized garment inpainting using latent diffusion models, significantly enhancing consumers' online shopping experience. However, existing VTON technologies neglect the need for merchants to showcase garments comprehensively, including flexible control over garments, optional f…

1
A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights 36:34

27d ago36:34

36:34

Human video generation is a dynamic and rapidly evolving task that aims to synthesize 2D human body video sequences with generative models given control conditions such as text, audio, and pose. With the potential for wide-ranging applications in film, gaming, and virtual communication, the ability to generate natural and realistic human video is c…

1
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence 49:58

28d ago49:58

49:58

The rapid advancement of large language models (LLMs) has paved the way for the development of highly capable autonomous agents. However, existing multi-agent frameworks often struggle with integrating diverse capable third-party agents due to reliance on agents defined within their own ecosystems. They also face challenges in simulating distribute…

1
SEED-Story: Multimodal Long Story Generation with Large Language Model 22:27

30d ago22:27

22:27

With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad app…

1
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models 39:20

1M ago39:20

39:20

While language models (LMs) have shown potential across a range of decision-making tasks, their reliance on simple acting processes limits their broad deployment as autonomous agents. In this paper, we introduce Language Agent Tree Search (LATS) -- the first general framework that synergizes the capabilities of LMs in reasoning, acting, and plannin…

1
LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control 39:35

1M ago39:35

39:35

Portrait Animation aims to synthesize a lifelike video from a single source image, using it as an appearance reference, with motion (i.e., facial expressions and head pose) derived from a driving video, audio, text, or generation. Instead of following mainstream diffusion-based methods, we explore and extend the potential of the implicit-keypoint-b…

1
Agentless: Demystifying LLM-based Software Engineering Agents 35:54

1M ago35:54

35:54

Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents…

1
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? 36:47

1M ago36:47

36:47

Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for speci…

1
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code 27:24

1M ago27:24

27:24

Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that a…

1
Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image 22:25

1M ago22:25

22:25

In this work, we introduce Unique3D, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large…

1
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence 37:18

1M ago37:18

37:18

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Co…

1
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time 38:01

2M ago38:01

38:01

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fin…

1
RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture 1:06:40

2M ago1:06:40

1:06:40

There are two common ways in which developers are incorporating proprietary and domain-specific data when building applications of Large Language Models (LLMs): Retrieval-Augmented Generation (RAG) and Fine-Tuning. RAG augments the prompt with the external data, while fine-Tuning incorporates the additional knowledge into the model itself. However,…

1
Seven Failure Points When Engineering a Retrieval Augmented Generation System 21:27

2M ago21:27

21:27

Software engineers are increasingly adding semantic search capabilities to applications using a strategy known as Retrieval Augmented Generation (RAG). A RAG system involves finding documents that semantically match a query and then passing the documents to a large language model (LLM) such as ChatGPT to extract the right answer using an LLM. RAG s…

1
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning 42:09

2M ago42:09

42:09

Language agents perform complex tasks by using tools to execute each step precisely. However, most existing agents are based on proprietary models or designed to target specific tasks, such as mathematics or multi-hop question answering. We introduce Husky, a holistic, open-source language agent that learns to reason over a unified action space to …

1
Recurrent Context Compression: Efficiently Expanding the Context Window of LLM 38:11

2M ago38:11

38:11

To extend the context length of Transformer-based large language models (LLMs) and improve comprehension capabilities, we often face limitations due to computational resources and bounded memory storage capacity. This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of LLM…

1
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs 33:39

2M ago33:39

33:39

Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur…

1
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning 40:44

2M ago40:44

40:44

Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opp…

1
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time 39:55

2M ago39:55

39:55

We introduce VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) given a single static image and a speech audio clip. Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio, but also capturing a large spectrum of facial nuances and …

1
”Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models 54:48

2M ago54:48

54:48

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we con…

1
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models 41:58

2M ago41:58

41:58

Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict function improvement when increasing the pre-training scale. These claims of excelling in different function…

1
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models 33:37

2M ago33:37

33:37

We introduce Buffer of Thoughts (BoT), a novel and versatile thought-augmented reasoning approach for enhancing accuracy, efficiency and robustness of large language models (LLMs). Specifically, we propose meta-buffer to store a series of informative high-level thoughts, namely thought-template, distilled from the problem-solving processes across v…

1
GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning 33:57

2M ago33:57

33:57

Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Question Answering over KGs (KGQA) is the task of answering natural questions grounding the reasoning to the information provided by the KG. Large Language Models (LLMs) are the state-of-the-art models fo…

1
AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct} 28:24

2M ago28:24

28:24

We introduce AutoCoder, the first Large Language Model to surpass GPT-4 Turbo (April 2024) and GPT-4o in pass@1 on the Human Eval benchmark test 90.9% vs. 90.2%). In addition, AutoCoder offers a more versatile code interpreter compared to GPT-4 Turbo and GPT-4o. It's code interpreter can install external packages instead of limiting to built-in pac…

1
From Sora What We Can See: A Survey of Text-to-Video Generation 1:27:32

2M ago1:27:32

1:27:32

With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that…

1
The Future of Large Language Model Pre-training is Federated 34:55

3M ago34:55

34:55

Generative pre-trained large language models (LLMs) have demonstrated impressive performance over a wide range of tasks, thanks to the unprecedented amount of data they have been trained on. As established scaling laws indicate, LLMs' future performance improvement depends on the amount of computing and data sources we can leverage for pre-training…

1
Long-form factuality in large language models 37:52

3M ago37:52

37:52

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be…

1
Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head 42:15

3M ago42:15

42:15

End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this pape…

1
Retrieval-Augmented Generation for AI-Generated Content: A Survey 1:13:57

3M ago1:13:57

1:13:57

Advancements in model algorithms, the growth of foundational models, and access to high-quality datasets have propelled the evolution of Artificial Intelligence Generated Content (AIGC). Despite its notable successes, AIGC still faces hurdles such as updating knowledge, handling long-tail data, mitigating data leakage, and managing high training an…

1
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning 26:36

3M ago26:36

26:36

Low-rank adaptation is a popular parameter-efficient fine-tuning method for large language models. In this paper, we analyze the impact of low-rank updating, as implemented in LoRA. Our findings suggest that the low-rank updating mechanism may limit the ability of LLMs to effectively learn and memorize new knowledge. Inspired by this observation, w…

1
LightAutoML: AutoML Solution for a Large Financial Services Ecosystem 54:50

3M ago54:50

54:50

We present an AutoML system called LightAutoML developed for a large European financial services company and its ecosystem satisfying the set of idiosyncratic requirements that this ecosystem has for AutoML solutions. Our framework was piloted and deployed in numerous applications and performed at the level of the experienced data scientists while …

1
Efficient Multimodal Large Language Models: A Survey 1:12:40

3M ago1:12:40

1:12:40

In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficie…

1
The Platonic Representation Hypothesis 45:05

3M ago45:05

45:05

We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision model…

1
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment 33:04

3M ago33:04

33:04

Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their…

1
LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks 52:21

3M ago52:21

52:21

Penetration testing, an essential component of software security testing, allows organizations to proactively identify and remediate vulnerabilities in their systems, thus bolstering their defense mechanisms against potential cyberattacks. One recent advancement in the realm of penetration testing is the utilization of Language Models (LLMs). We ex…

1
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval 36:53

3M ago36:53

36:53

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages…

1
A decoder-only foundation model for time-series forecasting 19:41

3M ago19:41

19:41

Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based…

Podcasts Worth a Listen

Cinemosity Podcasts

Podcasts Worth a Listen

Quick Reference Guide