Automated Machine Learning with TransmogrifAI

12:49
 
Share
 

Manage episode 226311056 series 1951941
By Discovered by Player FM and our community — copyright is owned by the publisher, not Player FM, and audio streamed directly from their servers.

Would you rather take a year to develop a proprietary algorithm for your company that has an accuracy of 95% or use an open source platform that takes a day to develop an algorithm that has nearly the same accuracy? In most business cases, you’d choose the latter. In this episode, we talk to Till Bergmann who works on a team that developed TransmogriAI, an open source project that helps you build models quickly.

Intro: Instead of running one model and running the next model and then comparing it and then running the another model and comparing it, it just basically runs multiple models at the same time and chooses the best model automatically, so you as a data scientist, you don’t have to do anything. It just does it for you automatically.

Ginette: I’m Ginette.

Curtis: And I’m Curtis.

Ginette: And you are listening to Data Crunch.

Curtis: A podcast about how applied data science, machine learning, and artificial intelligence are changing the world.

Ginette: A Vault Analytics production.

If having some basic skills in predictive analytics and statistics would help you in your job, and you don’t have a programming or data science background, we just updated our book on how to use simple predictive analytics in a business setting, and it’s available now on Amazon. This is the launch of the book’s second edition, and for the first three months, we’re keeping the price low so you can pick it up for $5 dollars. It goes over some basic statistics and predictive models, including regression, ANOVA, and Chi-Square tests, and shows you step by step how to do them in Excel. It warns you of common pitfalls, and was written to be accessible and actionable for a business audience. If that’s interesting to you, the book is called “simple predictive analytics: Using Excel to solve business problems”, written by Curtis Seare, and you can head over to Amazon and just search for “simple predictive analytics” to pull it up. You could also go to datacrunchpodcast.com/books and there will be a link there you can follow. Now let’s head to today’s episode.

Curtis: We’re talking Till Bergmann who works on Salesforce’s Einstein, which is the group charged with making machine learning and AI a success, at the largest CRM company in the world.

Till: I did a PhD in cognitive science, and while I was doing that, I noticed and realized that the data part was the most interesting part to me, and so I finished my PhD and then I joined Salesforce a few years ago, and I’ve been here ever since.

I’m more of an applied data scientist, so I work on the team that has built TransmogrifAI, which is our open source machine learning library that sits on top of Spark machine learning, and that’s the internal tool and now also external tool that we built to power machine learning in Salesforce, so mostly it’s both used by other teams that work in the Einstein space to build their machine learning apps, and then it’s also used by our own team to build our machine learning apps, and at the same time, we developed the whole library also for open source.

Ginette: Salesforce is a huge company that has lots of data. Companies around the world use Salesforce to keep track of all the data on their customers and prospects, among other things. As you can imagine, this data is very valuable to their clients, and Salesforce is working on ways to find value and efficiencies in working with it.

Till: The one thing that we noticed when we were dealing with all the data that we had here at Salesforce. It’s a variety of data. We have customers who are Fortune 500 customers that have massive amounts of data, and we have nonprofits that have small amounts of data, so we really had to build a variety of models that shared some similarities but had to deal with all these details of different data structures, different complexities, and we realized if we just followed traditional pipeline building, we would never really be able to build a model for each customer. You would really have to just pick and choose a few customers, and that’s not what we wanted to do. The features that we added, they really enable us to build models much much more quickly and iterate quicker.

So to give you an example, the usual machine learning pipeline is you read in some data, you transform the features that you have there to gain more information and then you build some models and you iterate on the models to find the best model for your data, and usually that takes weeks to go through the pipeline process, and with TransmogrifAI, you can really do this in a few hours because it really abstracts a lot of these complexities away. Instead of running one model and running the next model and then comparing it and then running the another model and comparing it, it just basically runs multiple models at the same time and chooses the best model automatically.

Curtis: So does TransmogrifAI choose the features that go into the model as well?

Till: Data that we deal with, every customer has different features, different data, so also we can’t really look into each column of the data and figure out “what should we do here? What kind of feature transformation should we do there?” So we have the very extensive auto machine learning pad that does this automatic feature transformation, and so we started by doing the standard things. If it’s a categorical verbal, they’ll extract all the values and hard encode them so that each value is one new column. Now we’re doing smarter things, maybe also figure out if it’s an email address, you can extract a domain name automatically. If it’s text, you can figure out that while sometimes the data comes in as text, but really what it is, is categorical twos. We can automatically detect that. If there’s only three or four values in one column, we can treat it as categorical variable, rather than doing normal text analysis on it.

In TransmogrifAI we really from the beginning, we built this whole concept of model expandability into the system, so we make sure that, we talked about how that we transformed features automatically and we make sure that the features that we actually create can be traced back to the original feature, so some features that are really great at giving you a good model, they’re really hard to interpret, so we make sure that at the end of the model, we can kind of trace it back to the original feature the original column of the data that whatever person is on the other side who created this data maybe actually understands where this feature is coming from, so at the end, you can put this in normal English basically. You can say something you can put this person hasn’t contacted anyone through email in three months, and that’s why they’re more likely to be at risk.

Ginette: AutoML is an interesting topic because it could save companies a lot of time and money. So is it all it’s cracked up to be, according to Till?

Till: I think that this automatic machine learning really can take you much, much further than a lot of people realize, especially the problems that a lot of companies have. The problems themselves are not unique. The data is unique, and you can really spend a lot of time on focusing on the data and extracting all the things there automatically. If you have your own data science team and your own infrastructure team that backs the data science team and so on and your whole devops team, and the whole data science team can spend a year to build a model just for your company. That model will probably be the automachine learning model, but you will have spent a lot of time a lot of money rather than just having a model that’s probably nearly as good in a few hours. So I would say that it gets really close to a custom built data science model.

Curtis: What Till is talking about here is mostly for structured customer data.

Till: The data that we deal with is almost exclusively structured data, that means you can imagine it as giant spreadsheets with columns and rows. And the underlying models that we throw at it are very straight forward, like logistic regression, random forest, very traditional models that have been around for awhile, but because of the added complexity of the data, there’s definitely a lot of novel methods that we’ve developed for TransmogrifAI to deal with all the other issues that come with the problem space.

I think every company, every business has a problem that they can frame as, for example, a binary classification, where they just want to ask a question, and the answer is “yes” or “no.” So in that sense, yes, it’s a little more traditional, it’s a little more common than having some image recognition problem.

Ginette: But even with structured data and the autoML system they’ve put into place, there still are a lot of unique data science situations that come up.

Till: So even though we have this awesome platform, there’s still a lot of small things that pop up with every new customer that we have. Every new customer brings their own data, their own complexity in the data, so even though you know we are in really good shape in the platform, there’s still always very interesting data science problems to solve, and a lot of this stuff that we have in terms of data is just we have to do a lot of automatic data cleaning, data prep, so that’s one part that we also work through TransmogrifAI, so you know customers they really customize the data in a lot of ways, so sometimes they add a new column, and they use it for 3 months, and then they no longer use it. Then, you know, if you just dump that into your machine learning model, it’ll use that and try to make some predictions out of it and use it in the model and that’s something that we realized, you know, if it’s really only used for three months, is that something that you should actually have as part of your model, and again we abstracted that all away and put it into TransmogrifAI so we can detect those things automatically without you as the data scientist having to really do this for every single one of the 600 columns that a customer might have.

Ginette: So what are some examples of customers that the Einstein team actually does step in and help?

Till: So one customer that I actually worked with personally and I was working with them on their model too, they’re a nonprofit called College Forward, and they’re trying to predict whether high school students go to college or not, and we built them a model and through our product called Einstein Prediction Builder to basically give them some insight into which high school students end up going to college and which ones are probably at risk of maybe not going, and that was really, for me, it was really cool because it was really something where they can now help people that maybe don’t go for whatever reason, you know, maybe not going to college can help them to maybe actually go to college and figure out what kind of factors are that play with people going to college or not.

The cool thing about the more traditional methods that we use is that you kind of get two for one goals here, so you can kind of predict, so these people are more likely at risk to not go to college, so they can just go and talk to them maybe, you know, intervene a little bit earlier and contact them and just give them a call, and the other thing is the second thing you get with our models is you get some model insights, and that means, we don’t just give you a prediction of saying this person is at 90 percent risk of not going to college, we also give you the reasons behind it, so we can say this person is likely to not go to college because they haven’t checked in by email in the last three months or something, and that really helps them to really understand what is happening and then also builds trust in our models because there’s still some kind of like element of magic involved in machine learning sometimes where the customers that we give the models to, they aren’t data scientists, so if you just give them numbers, they want to know okay why is this prediction the way it is, so we have a heavy focus on that too.

Curtis: The Einstein group is trying to design systems to help non-data scientist clients interact with their data in a meaningful way and get the benefit of machine learning, without having to understand all the nuts and bolts that go into it.

Till: One of the things that we also really try very hard to work on, and I think we’ve achieved it to a large degree is that we don’t want data science or machine learning to be something that the user has to really actively interact with. It should really just kind of integrate into the problem, so a lot of these things, we purposely didn’t want to have the customer decide, use this column or that column. We want to basically solve all of this automatically so that the customer at the end just gets the best model that they can get without having themselves to understand data science and really go through it. They can of course tell us not to use certain columns to make sure that you know that if there’s some data in there you shouldn’t be using for whatever reason, whether it’s, it’s regulations, or they know it’s bad data and also you know, these automatic things, that they aren’t using anything that could bias the data, and by that I mean just using columns with gender and race in it.

Ginette: Using the tools they are building, non technical users can interact with the data and models in a way that makes sense to them, and they can have a say in what data gets used in an effort to help remove data they know would be against regulations to use, or that would add unwanted bias to their models. They’re on trend here, as many others are also working on ways to bring data science to the masses, and keep models from having unwanted bias.

Curtis: We’d like to give a thank you to Till Bergman and Salesforce for being on our show today. And as a reminder, you can head over to datacrunchpodcast.com/books, or search for “simple predictive analytics” in Amazon and pick up a copy for $5 for the next three months.

Links

datacrunchpodcast.com/books

Attributions

Music

“Loopster” Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License
http://creativecommons.org/licenses/by/3.0/

62 episodes available. A new episode about every 25 days averaging 19 mins duration .