When Song Lyrics and British Lit Meet Tidy Text


Manage episode 218658721 series 1951941
By Discovered by Player FM and our community — copyright is owned by the publisher, not Player FM, and audio streamed directly from their servers.
When Julia Silge's personal interests meet her professional proficiencies, she discovers new meaning in Jane Austen's literature, and she gauges the cultural influence of locations in pop songs. Even more impressive than these finds, though, is that she and her collaborator, Dave Robinson, have developed some new, efficient ways to mine text data. Check out the book they've written called Tidy Text Mining with R. Below is a partial transcript. For the full interview, listen to the podcast episode by selecting the Play button above or by selecting this link, or you can also listen to the podcast through Apple Podcasts, Google Play, Stitcher, and Overcast. Transcript Julia Silge: “One that I worked on that was really fun was about song lyrics. The last 50 years or so of pop songs, we have all these lyrics, so all this text data, and I wanted to ask the question, what places are mentioned more or less often in these pop songs.” Ginette: “I’m Ginette.” Curtis: “And I’m Curtis.” Ginette: “And you are listening to Data Crunch.” Curtis: “A podcast about how data and prediction shape our world.” Ginette: “A Vault Analytics production.” Curtis: “Brought to you by data.world, the social network for data people. Discover and share cool data, connect with interesting people, and work together to solve problems faster at data.world. Whether you’re already a frequent dataset contributor or totally new to data.world, there are several resources you can use to stay in the loop on the latest features, learn new skills, and get support. Check out docs.data.world for up-to-date API documentation, tutorials on SQL, and other query techniques, and much more!” Ginette: “We hope you’re enjoying some vacation time this summer. We just did, and now Data Crunch is back! To hear the latest from us, add us on Twitter, @datacrunchpod. Today we hear from an exciting guest—someone who is on the cutting edge of data science tool creation, someone exploring and developing new ways to slice and dice difficult data.” Julia: “My name is Julia Silge, and I'm a data scientist at Stack Overflow. My academic background is in physics and astronomy, but I’ve worked in academia, teaching and doing research, I worked at an ed tech start up, and I've made a transition now into data science.” Ginette: “Stack Overflow, where Julia works, is the largest online community for programmers to learn, share knowledge, and build their careers. It's a great resource when you need to solve a coding problem or develop new skills.” Curtis: “Now there are basically two main camps in data science: people who program with R, a statistical programming language, and people who program with Python, a high-level, general purpose language. Both languages have devoted followers, and both do excellent work. Today, we’re looking at R, and Julia is a big name in this space, as is her collaborator Dave Robinson.” Julia: “Text is increasingly a really important part of our work as people who are involved in data. Text is being generated all the time, at ever faster rates. This unstructured data is becoming a really important part of things that we do. I also am somebody that—my academic background is not in text or literature or natural language processing or anything like that, but I am somebody who's always been a reader and always been interested in language, and these sort of collection of circumstances kind of all came together to converge that me and Dave decided to develop some tools for making text mining something that people can do within this idiom of people who work using the R programming language. So we’ve developed a package called tidy text.” Ginette: “Now this particular tool is based on tidy data principles, which is basically organizing data in a uniform way so it’s ready for you to ferret out insights.” Julia: “There's a section of people who use tools that are built for dealing with tidy data principles,

66 episodes available. A new episode about every 23 days averaging 22 mins duration .