Roman Cheplyaka public
[search 0]
Download the App!
show episodes
 
Loading …
show series
 
In this episode, Marie Sadler talksabout her recent Cell Genomics paper, Multi-layered genetic approaches toidentify approved drug targets. Previous studies have found that the drugs that target a gene linked to thedisease are more likely to be approved. Yet there are many ways to define whatit means for a gene to be linked to the disease. Perhaps …
  continue reading
 
Today on the podcast we have Tomasz Kociumaka and Dominik Kempa,the authors of the preprintCollapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space. The suffix array is one of the foundational data structures in bioinformatics,serving as an index that allows fast substring searches in a large text.However, i…
  continue reading
 
In this episode,David Dylus talks aboutRead2Tree,a tool that builds alignment matrices and phylogenetic trees from rawsequencing reads.By leveraging the database of orthologous genes called OMA, Read2Tree bypasses traditional, time-consuming steps such as genome assembly, annotation and all-versus-all sequence comparisons. Links: Inference of phylo…
  continue reading
 
This is the third and final episode in the AlphaFold series, originally recorded on February 23, 2022,with Amelie Stein, now an associate professor at the University of Copenhagen. In the episode, Amelie explains what 𝛥𝛥G is, how it informs uswhether a particular protein mutation affects its stability, and how AlphaFold 2helps in this analysis. A n…
  continue reading
 
This is the second episode in the AlphaFold series, originally recorded on February 14, 2022,with Janani Durairaj, a postdoctoralresearcher at the University of Basel. Janani talks about how she used shape-mers and topic modelling to discoverclasses of proteins assembled by AlphaFold 2 that were absent from the ProteinData Bank (PDB). The bioinform…
  continue reading
 
In this episode, originally recorded on February 9, 2022,Roman talks to Pedro Beltraoabout AlphaFold, the software developed by DeepMind that predicts a protein’s3D structure from its amino acid sequence. Pedro is an associate professor at ETH Zurich and the coordinator ofthe structural biology community assessment of AlphaFold2 applications projec…
  continue reading
 
In this episode, Jacob Schreiber interviews Žiga Avsec abouta recently released model, Enformer. Their discussion begins with lifedifferences between academia and industry, specifically about how researchis conducted in the two settings. Then, they discuss the Enformer model,how it builds on previous work, and the potential that models like it have…
  continue reading
 
The Bioinformatics Contest is back this year, and we are back to discussit! This year’s contest winnersMaksym Kovalchuk (1st prize) andMatt Holt (2nd prize)talk about how they approachparticipating in the contest and what strategies have earned them the topscores. Timestamps and links for the individual problems: 00:10:36 Genotype Imputation 00:21:…
  continue reading
 
In this episode, Apostolos Chalkis presents sampling steadystates of metabolic networks as an alternative to the widely used flux balanceanalysis (FBA). We also discuss dingo, aPython package written by Apostolos that employs geometric random walks tosample steady states. You can see dingo in actionhere. Links: Dingo on GitHub Searching for COVID-1…
  continue reading
 
In this episode, Jacob Schreiber interviews Da-Inn Erika Lee aboutdata and computational methods for making sense of 3D genome structure. They begin their discussion by talking about 3D genome structure at a high level and the challenges in working with such data. Then, they discuss a method recently developed by Erika, named GRiNCH, that mines thi…
  continue reading
 
In this episode, Michael Love joins us to talk about the differential geneexpression analysis from bulk RNA-Seq data. We talk about the history of Mike’s own differential expression package,DESeq2, as well as other packages in this space, like edgeR and limma, and thetheory they are based upon. Mike also shares his experience of being theauthor and…
  continue reading
 
In this episode, Lindsay Pino discusses thechallenges of making quantitative measurements in the field of proteomics.Specifically, she discusses the difficulties of comparing measurements acrossdifferent samples, potentially acquired in different labs, as well as a methodshe has developed recently for calibrating these measurements without the need…
  continue reading
 
In this episode, we learn about B cell maturation and class switching fromHamish King. Hamish recently published apaper on this subject in Science Immunology, where he and his coauthorsanalyzed gene expression and antibody repertoire data from human tonsils.In the episode Hamish talks about some of the interesting B cell states heuncovered and shar…
  continue reading
 
In this episode, Jacob Schreiber interviews Molly Gasperini aboutenhancer elements. They begin their discussion by talking about Octant Bio,and then dive into the surprisingly difficult task of defining enhancers anddetermining the mechanisms that enable them to regulate gene expression. Links: Octant Bio Towards a comprehensive catalogue of valida…
  continue reading
 
Polygenic risk scores (PRS) rely on the genome-wide association studies (GWAS)to predict the phenotype based on the genotype. However, the predictionaccuracy suffers when GWAS from one population are used to calculate PRS withina different population, which is a problem because the majority of the GWASare done on cohorts of European ancestry. In th…
  continue reading
 
In this episode, we chat about phylogenetics with Xiang Ji. We start with ageneral introduction to the field and then go deeper into the likelihood-basedmethods (maximum likelihood and Bayesian inference). In particular, we talkabout the different ways to calculate the likelihood gradient, including alinear-time exact gradient algorithm recently pu…
  continue reading
 
In this episode, Markus Schmidt explains how seeding in read alignment works.We define and compare k-mers, minimizers, MEMs, SMEMs, and maximal spanning seeds.Markus also presents his recent work on computing variable-sized seeds (MEMs,SMEMs, and maximal spanning seeds) from fixed-sized seeds (k-mers andminimizers) and his Modular Aligner. Links: A…
  continue reading
 
In this episode, Jacob Schreiber interviews Devin Schweppe aboutthe analysis of mass spectrometry data in the field of proteomics. They beginby delving into the different types of mass spectrometry methods, including MS1, MS2, and, MS3, and the reasons for using each. They then discuss a recent paperfrom Devin, Full-Featured, Real-Time Database Sea…
  continue reading
 
In this episode, Will Freyman talks about identity-by-descent (IBD): howit’s used at 23andMe, and how the templatedpositional Burrows-Wheeler transform can find IBD segments in the presence ofgenotyping and phasing errors. Links: Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform (William A. Freyma…
  continue reading
 
In this episode, Jacob Schreiber interviews David Kelley aboutmachine learning models that can yield insight into the consequences ofmutations on the genome. They begin their discussion by talking about Calico Labs, and then delve into a series of papers that David haswritten about using models, named Basset and Basenji, that connect genome sequenc…
  continue reading
 
In this episode, Jacob Schreiber interviews Jill Moore aboutrecent research from the ENCODE Project. They begin theirdiscussion with an overview and goals of the ENCODE Project, and thendiscuss a bundle of papers that were recently published in variousNature journals and the flagship paper, Expanded encyclopaedias of DNA elements in the human and m…
  continue reading
 
In systems biology, Boolean networks are a way to model interactions such asgene regulation or cell signaling. The standardinterpretations of Boolean networks are the synchronous, asynchronous, andfully asynchronous semantics. In this episode, Loïc Paulevé explains how thesame Boolean networks can be interpreted in a new, “most permissive” way.Loïc…
  continue reading
 
In this episode, Jacob Schreiber interviews Marinka Zitnik aboutapplications of machine learning to drug development.They begin their discussion with an overview of open research questions in thefield, including limiting the search space of high-throughput testing methods,designing drugs entirely from scratch, predicting ways that existing drugs ca…
  continue reading
 
NGLess is a programming language specificallytargeted at next generation sequencing (NGS) data processing.In this episode we chat with its main developer, Luis PedroCoelho, about the benefits of domain-specificlanguages, pros and cons of Haskell in bioinformatics, reproducibility, and ofcourse NGLess itself. Links: NGLess on GitHub NG-meta-profiler…
  continue reading
 
In this episode, I continue to talk (but mostly listen) to Sergey Koren and Sergey Nurk.If you missed the previous episode, you should probably start there.Otherwise, join us to learn about HiFi reads, the tradeoff between read lengthand quality, and what tricks HiCanu employs to resolve highly similar repeats. Links: HiCanu: accurate assembly of s…
  continue reading
 
In this episode, Sergey Nurk and Sergey Koren from the NIH share their thoughtson genome assembly. The two Sergeys tell the stories behind their amazingcareers as well as behind some of the best known genome assemblers: Celeraassembler, Canu, and SPAdes. Links: Canu on GitHub SPAdes on GitHub If you enjoyed this episode, please consider supporting …
  continue reading
 
Porcupine is a molecular tagging system—a way to tag physicalobjects with pieces of DNA called molecular bits, or molbits for short.These DNA tags then can be rapidly sequenced on an Oxford Nanopore MinIONdevice without any need for library preparation. In this episode, Katie Doroschak explains how Porcupine works—how molbitsare designed and prepar…
  continue reading
 
Will Townes proposes a new, simpler way to analyze scRNA-seq data with uniquemolecular identifiers (UMIs). Observing that such data is not zero-inflated,Will has designed a PCA-like procedure inspired by generalized linear models(GLMs) that, unlike the standard PCA, takes into account statisticalproperties of the data and avoids spurious correlatio…
  continue reading
 
In this episode, we hear from Amatur Rahmanand Karel Břinda, whoindependently of one another released preprints on the same concept, calledsimplitigs or spectrum-preserving string sets. Simplitigs offer a way toefficiently store and query large sets of k-mers—or, equivalently, large deBruijn graphs. Links: Simplitigs as an efficient and scalable re…
  continue reading
 
Kris Parag is here to teach us about the mathematical modeling ofinfectious disease epidemics. We discuss the SIR model, the renewal models, and howinsights from information theory can help us predict where an epidemic isgoing. Links: Optimising Renewal Models for Real-Time Epidemic Prediction and Estimation (KV Parag, CA Donnelly) Adaptive Estimat…
  continue reading
 
Does a given bacterial gene live on a plasmid or the chromosome? Whatother genes live on the same plasmid? In this episode, we hear from Sergio Arredondo-Alonso and Anita Schürch, whoseprojects mlplasmids and gplas answer these types of questions. Links: mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for singl…
  continue reading
 
In this episode, Benjamin Callahan talks about some of the issues faced bymicrobiologists when conducting amplicon sequencing and metagenomic studies. The two main themes are: Why one should probably avoid using OTUs (operational taxonomic units) and use exact sequence variants (also called amplicon sequence variants, or ASVs), and how DADA2 manage…
  continue reading
 
In this episode, Luke Anderson-Trocmétalks about his findings from the 1000 Genomes Project. Namely, the earlysequenced genomes sometimes contain specific mutational signatures thathaven’t been replicated from other sources and can be found via theirassociation with lower base quality scores. Listen to Luke telling the storyof how he stumbled upon …
  continue reading
 
In this episode, I talk with Irineo Cabreros about causality. We discuss whycausality matters, what does and does not imply causality, and twodifferent mathematical formalizations of causality: potential outcomes anddirected acyclic graphs (DAGs). Causal models areusually considered external to and separate from statistical models, whereasIrineo’s …
  continue reading
 
In this episode, we hear from Romain Lopez and Gabriel Misrachi aboutscVI—Single-cell Variational Inference.scVI is a probabilistic model for single-cell gene expression data thatcombines a hierarchical Bayesian model with deep neural networks encoding theconditional distributions. scVI scales to over one million cells and can beused for scRNA-seq …
  continue reading
 
Even though the double-stranded DNA has the famous regular helical shape,there are small variations in the geometry of the helix depending on whatexact nucleotides its made of at that position. In this episode of the bioinformatics chat, Hassan Samee talks about therole the DNA shape plays in recognition of the DNA by DNA-binding proteins,such as t…
  continue reading
 
An αβ T-cell receptor is composed of two highly variable protein chains, the αchain and the β chain. However, based only on bulk DNA or RNA sequencing it isimpossible to determine which of the α chain and β chain sequences were pairedin the same receptor. In this episode, Kristina Grigaityte talks about her analysis of 200,000paired αβ sequences, w…
  continue reading
 
Modern genome assembly projects are often based on long reads in an attempt tobridge longer repeats. However, due to the higher error rate of the currentlong read sequencers, assemblers based on de Bruijn graphs do not work well inthis setting, and the approaches that do work are slower. In this episode, Mikhail Kolmogorov fromPavel Pevzner’s lab j…
  continue reading
 
In this episode, we hear from Jacob Schreiber about his algorithm,Avocado. Avocado uses deep tensor factorization to break a three-dimensional tensor ofepigenomic data into three orthogonal dimensions corresponding to cell types,assay types, and genomic loci. Avocado can extract a low-dimensional,information-rich latent representation from the weal…
  continue reading
 
The third Bioinformatics Contest took place inFebruary 2019. Alexey Sergushichev, one of the organizers of the contest,and Gennady Korotkevich, the 1st prize winner,join me to discuss this year’s problems. Timestamps and links for the individual problems: Qualification round 00:07:14 Bee Population 00:14:12 Sequencing Errors 00:30:20 Transposable E…
  continue reading
 
Hi-C is a sequencing-based assay that provides information about the 3-dimensional organization of the genome.In this episode, Simeon Carstens explains how heapplied the Inferential Structure Determination (ISD) framework to build a 3Dmodel of chromatin and fit that model to Hi-C data using Hamiltonian MonteCarlo and Gibbs sampling. Links: Bayesian…
  continue reading
 
Long read sequencing technologies, such as Oxford Nanopore and PacBio,produce reads from thousands to a million base pairs in length,at the cost of the increased error rate. Trevor Pesoutdescribes how he and his colleagues leverage long reads for simultaneousvariant calling/genotyping and phasing. This is possible thanks to a cleveruse of a hidden …
  continue reading
 
This time you’ll hear from Fabio Cunial on the topic of Markov models andspace-efficient data structures. First we recall what a Markov model is andwhy variable-order Markov models are an improvement over the standard,fixed-order models. Next we discuss the various data structures and indexesthat allowed Fabio and his collaborators to represent the…
  continue reading
 
In this episode, HoJoon Lee and Seung Woo Cho explain how to perform a CRISPRexperiment and how to analyze its results. HoJoon and Seung Woo developed analgorithm that analyzes sequenced amplicons containing the CRISPR-induceddouble-strand break site and figures out what exactly happened there (e.g.a deletion, insertion, substitution etc.) Links: C…
  continue reading
 
Relief is a statistical method to perform feature selection. It could be used,for instance, to find genomic loci that correlate with a trait or genes whoseexpression correlate with a condition. Relief can also be made sensitive tointeraction effects (known in genetics as epistasis). In this episode, Trang Lê joins meto talk about Relief and her ver…
  continue reading
 
Kaushik Panda and Keith Slotkin come on the podcast to educate us aboutrepetitive DNA and transposable elements. We talk LINEs, SINEs, LTRs, and evenSleeping Beauty transposons! Kaushik and Keith explain why repeats matter for yourwhole-genome analysis and answer listeners’ questions. Links: Keith’s paper: The case for not masking away repetitive D…
  continue reading
 
Antoine Limasset joins me to talk about NGS read correction.Antoine and his colleagues built the read correction tool Bcool based on thede Bruijn graph, and it corrects reads far better than any of the current methodslike Bloocoo, Musket, and Lighter. We discuss why and when read correction is needed, how Bcool works, and whyit performs better but …
  continue reading
 
In this episode, I talk to Fernando Portela,a software engineer andamateur scientistwho works on RNA design — the problem of composing an RNA sequencethat has a specific secondary structure. We talk about how Fernando and others compete and collaborate in designing RNAmolecules in the online game EteRNA and about Fernando’s newRNA design algorithm,…
  continue reading
 
In this episode I’m joined by Chang Xu. Chang is a senior biostatisticianat QIAGEN and an author of smCounter2, a low-frequency somatic variant caller.To distinguish rare somatic mutations from sequencing errors, smCounter2relies on unique molecular identifiers, or UMIs, which help identify multiplereads resulting from the same physical DNA fragmen…
  continue reading
 
Linear mixed models are used to analyze GWAS data and detect QTLs.Andrey Ziyatdinov recently released an R package, lme4qtl, that can be used toformulate and fit these models.In this episode, Andrey and I discuss linear mixed models, genome-wide association studies, and strengths and weaknesses of lme4qtl. Links: Paper: lme4qtl: linear mixed models…
  continue reading
 
Loading …

Quick Reference Guide