Transcript
>> Okay, I think we're ready to go, you ready?
>> I am.
>> So I just count us down and we'll just jump into it. Three, ..
Hey, welcome everybody. Jeff Frick here coming to you from the home office here in Palo Alto, excited for another episode of Turn the Lens. And we've got a really interesting character coming to us from all the way of the other side of the world. He's been involved in the data science space, both from an educational point of view, as well as an academic point of view, life sciences. And now he's got a new role. So we're happy to welcome from across the pond, through the magic of the internet and Zoom. He's Hugo Bowne-Anderson, the Head of Data Science Evangelism and Marketing for Coiled. Hugo, great to see you.
>> Jeff, thank you so much for having me on the show. It's great to be here.
>> Absolutely, so just a quick check-in, how are you getting by during these crazy times? I think the end is starting to come in sight, but how are you getting by?
>> I'm doing pretty well, yeah. We've not done so well on the vaccine rollout here, but it's getting a lot better. Currently back in lockdown, but high hopes for the future as well and staying optimistic.
>> Right, great, well, let's jump into it. So for the people that didn't do the research before the interview, give them a little a breakdown of what Coiled is all about. Give them the Coiled 101.
>> So at Coiled, we're building products to help data scientists and organizations burst to the Cloud. Essentially, were born from the open source software ecosystem in Pythonic data science. We understand the needs of the enterprise, which involves deploying and productionizing a lot of large models on large datasets. There's a lot of challenges associated with this, and we want to obstruct those challenges away so that data scientists and organizations can get back to the work that they do best. And the product we're building is around the Pythonic data science space and a package called Dask in particular.
>> Okay, so for people that aren't familiar with kind of the progression between, kind of what is Python, what is Dask, what is Coiled and how do those pieces all fit together? Give us a quick kind of overview.
>> So, Python is a language, a programming language that has grown so rapidly over the past, five to 10 years, in particular, for both data science and for web frameworks. But we're focusing on the data science aspect here. So what we have now is 100s of 1000s, if not millions of people using Python for data science work and they're prototyping all of their stuff on their laptops, essentially. And the big disconnect is, between getting stuff prototyped on your laptop and getting it in production, getting applications running on the Cloud or on big clusters on-prem. So the whole point is that once you want to scale to do your deployed data analysis, data science, Machine Learning, AI, you want to work with bigger data, bigger models, more cores, this involves a lot more headache in terms of bursting to the Cloud, getting all your Dockers set up there, or your Kubernetes, that type of stuff.
And at that point, Coiled comes in. I should step back a bit and say, Dask comes in at the point where you want to scale your workflow to bigger data and bigger models. And then when you're wanting to do it on the Cloud at scale, that's when Coiled comes in.
>> Okay, so make sure I understand. So you've got Python was just kind of the fundamental, kind of open source software package for doing data science. Then you've got.
>> It's a particular subset of Python. It's really the Py data stack, so your NumPys, your (indistinct), scikit-learn for Machine Learning, exactly.
>> Okay, thank you for that clarification. And then Dask is really designed to parallelize Python for bigger jobs.
>> Absolutely.
>> Okay, and then Coiled is then to basically help enterprises implement Dask, is that accurate?
>> Exactly, enterprises and organizations, more generally.
>> Organizations, now it's interesting, right? Because you, before you became, you're in the business world, you were actually in the academic world and you were doing research, I think in molecular biology.
>> Yes, cell biology, biophysics.
>> Yeah, so, it means, so from a data scientist perspective, you talk about the problems that you guys are fixing, now, if you can step back into your role as data scientists and kind of talk about what were these things and how did they manifest themselves in your kind of day-to-day job, or were you running into these types of restrictions?
>> I was running into them and everyone I worked with it was running into them. And that's kind of why I'm so excited about the work we're doing now, so to go back to that story, I was working in cell biology and biophysics first, there's a big Max Planck Institute for cell biology and genetics in Dresden. Then I moved to the east coast to work on similar questions at Yale University in New Haven, Connecticut. And in all honesty, working with cell biologists, seeing that there were all these tools they could use, but they didn't know how to use them, how to access them, how to choose them, as we know, the attention economy has created a content landscape in which it's virtually impossible to see, to reduce the noise. So you can get signaled on where to find the tools, tools you need and I saw this happen time and time again, I saw grad students leave their programs.
I saw tenured professors not be able to publish papers because of the timelines involved and the lack of access to tooling, so what I thought would be a great project is to figure out how to get the right tools to the right people. And so I started running in person workshops at both these institutions around the same time. I met a few entrepreneurs who just started a startup called DataCamp and they were building online data science education, particularly in the R programming language, that we're looking for someone to build out the Python curriculum, as well as doing a lot of internal data science and product management and that type of stuff. That seemed like an incredible opportunity to reach a lot of people, particularly with the open source Pythonic data science stack. So I jumped on board and I was there just over four years.
I was very fortunate to be in the right place at the right time at a company which enabled the courses I created along with the ones that I created externally, with external instructors to reach over half a million learners worldwide. And the next step, of course, after these types of B2C offerings, is figuring out how to get this tooling system and ecosystem into the hands of the enterprise. And when I started chatting with Matt Rocklin who created Dask and was thinking about building a company that turned out to be Coiled, it was an opportunity that was too good to not take up, particularly as it happened when COVID kind of really started hitting last March and April.
>> Right, well, I want to dig down into a little bit, you talked about the frustration of the data scientists and people, even their programs and not being able to publish their papers. I mean, what was the fundamental problem was the tools were just not available, did they not have access to them? Had they not even been purchased were the rights and restrictions too hard or the process to get on them? I mean, what was the big problem that was getting in the way and how are you guys with Coiled, helping to make sure that doesn't happen in the future? Because data science is such a huge part of our future. AI is always talked about as the most important thing since the internet, in terms of transformational technology, you are actually talking about the people that write the algorithms, at the heart of this stuff. So what was this big problem that was frustrating them so bad that they couldn't use the tools that they needed?
>> The first problem is awareness. So for example, the Python, PSF, Python Software Foundation did a survey last year in which Python users respond to that around 5% of Python users use Dask, one in 20 Python users, which number millions, use Dask. And people say that's amazing. And my question is, why is it only 5%? If you think maybe half the people are doing data analysis, data sciency type of stuff. And then half of them probably need to generalize their CPU intensive workflows or work with medium-sized datasets. And the answer is not everyone knows about Dask still. And similarly with a lot of the tooling out there. So firstly, there's an awareness problem. Then on top of that, there's an education problem.
Even if you have heard about Dask, where do you find the curriculum to actually learn it? They don't teach it at college, right? And a lot of working professionals, and they're not in college anymore. They're trying to upskill while working full-time jobs as well. And the third challenge is infrastructure to make it as easily usable as possible, which is exactly what we're working on at Coiled. So the awareness and education part, I worked on a lot at DataCamp, and the education and infrastructure part is what I'm really interested in at Coiled. And what I mean by that is running Dask on your own laptop, really smooth experience. If you want to get up and running on the Cloud, on AWS, GCP or Azure, you've got to set up your account, your credentials, authentication. Then you've got to use Kubernetes and Docker authorization. And then suddenly you're a part time software engineer or DevOps engineer and you're not even doing the job that you're paid to right. So we abstract all of that away.
>> Right, I was going to say, kind of how has Cloud, change the game and is Cloud the big motivation, so people can run really big jobs and get them off their laptop, is that what you see is kind of really the big catalyst here?
>> Cloud and on-prem HPC as well. So a Cloud, I think is a general term for, a computation that doesn't happen here locally. And that may be one of the common Cloud providers, or you might work for an organization that has huge on on-prem clusters, so, exactly right.
>> So one of the big frustration things that comes up is you're talking about these poor people that can't get their stuff going, this is the simple ugly stuff about like provisioning and getting allocated resources and all these things to get a piece of infrastructure to run your big job. And I talked to Matt before and he talked about kind of a classic use case where someone will test an algorithm on a relatively small data set and get it to where they want it. And then now they want to pump in a massive data set and apply that apply that model. Potential problem with the Cloud has governance, right? And we hear all the time of people accidentally leaving a switch open, it's usually not a physical problem, but it's a user error. So there's big potential costs, implications of misuse of Cloud infrastructure. So I presume that from the it point of view, these are some of the features that they really want to make sure that they've got in place in your product.
>> Jeff, you are speaking my language. So our three major stakeholders in terms of the product, our Coiled Cloud product, individual contributors, IT, and then team leads. And maybe I can just break it down into each of those very briefly, we've talked a bit about individual contributors, so a lot of people may think that's your number one stakeholder in a sense, that's absolutely right. But if IT isn't happy, then it's very unlikely that your product will be adopted by that organization. So as we've discussed for individual contributors, and as you discussed with Matt Rocklin, they'll be prototyping on a laptop, right? Then they either want to deploy or productionize their models and data applications, or they'll want to move to bigger data and bigger models and leverage more cores.
And essentially in the end, they'll want to burst to the Cloud. And this introduces a lot of headache, the type of things I talked about, which turned them into defacto MLOps engineers, or DevOps engineers. Then we IT, and as you spoke to, when an individual contributor is trying to deploy the data application or put it in production, suddenly we need security, authentication. You mentioned costs, which is one of the biggest concerns, the need for IT to be able to shut down something that's happening and draining a huge amount of budget, and for that reason, of course, we've made Coiled enterprise ready to solve these security concerns, authentication concerns, of course, the huge one, the shutdown concerns, as well. The third stakeholder I spoke to were team leads, who have some similar concerns from a slightly different perspective to IT, they also require visibility into everything that's happening.
They need to keep an eye on costs. They also very interested in enabling collaboration. So what's happening as data science teams scale, and data functions and organizations scale is that you get a lot of people duplicating work essentially in different teams, your growth team, maybe working on a database and creating features that another team is doing as well. That's why we're seeing a lot of feature store action happening at the moment as well, actually. But the whole point is that team leads require visibility into everything that's happening. And for that reason, we provide advanced telemetry for them to be able to do so. So to recap, the three stakeholders are, IC, individual contributors, are IT and team leads.
>> Yeah and I would imagine IT is the most important in terms of you guys getting paid and you guys getting implemented and you guys getting deployed.
>> Anyone can block a Jeff.
>> What's that?
>> Anyone can block a deal, right?
>> Yeah, exactly.
>> IT is key but anyone can block a deal.
>> That's true, well, let's shift gears a little bit about something more positive than deal blocking. Talk about your role as an evangelist. I think it's one of the coolest titles in tech, to have that opportunity to be an evangelist. So talk a little bit about, what do you do day to day? What's kind of the role of the evangelism, and you also have the marketing hat as well. So that's pretty good because a large part of marketing is telling stories and out talking to the people. So tell us a little bit about your kind of day to day and the role of an evangelist at Coiled.
>> Look evangelism and what's also referred to as developer relations or DevRel, I think is one of the most important things happening at the moment. We see the long tail of data infrastructure companies. You can go and look at the data and AI landscapes that Matt took, 'coz has put out and then see how crowded the space is becoming and developer relations and evangelism is a way to give people who are looking for tools, higher signal in this really full on information landscape we have currently, right? So my job is to increase signal to noise ratio in content for people who may be interested in using Coiled. So what does that involve? That involves telling the stories of Dask users and Coiled users. And that's really cool for me because as you know, I love this space, I love all the stories.
I love the scientists. So being able to do live streams or webinars or write blog posts about all the cool stuff people are using these tools for, whether it's NASA or Harvard Medical School, or Capital One or Walmart, right? That part of my job is super exciting, on top of that correlated with that, is generally advocating for the open source ecosystem, which as and I think as your view is now know is something that really excites me and gets me out of bed every morning with a smile on my face. The other K-pop is breaking down the spice into digestible chunks, so making sure that if someone, is trying to figure out how to re partition a Dask data frame, right? And I appreciate a lot of your viewers might not know what that means, well, that's part of the point, right? It's a pretty technical thing to do, but if someone's trying to find that I want to make sure that that information is available. So thinking about people all through the Python data, science stack funnels, so to speak, and that's where my marketing hat kind of, I've started using terms like funnel Jeff.
>> The funnel is your friend.
>> Conversion rates, but all that aside, breaking down the space into digestible chunks, vertical use cases, individual use cases, different parts of the ecosystem. When do you use Machine Learning, and when do you use statistical inference? What does Anaconda do? What does Coiled do? What do these other companies do? Trying to describe the space when most people can't see the forest for the trees. So part of my job is to describe the forest, which is pretty exciting.
>> So do you focus mainly on the giant Python ecosystem of which you said a relatively small percent is even using something like Dask to go after them to enlighten them on the potential, or is it petite people that are kind of outside the Pythonic ecosystem, as you say that maybe you're trying to evangelize that, hey, maybe you should consider this technology stack versus a different technology stack for your AI and ML needs.
>> I think in the end everything's fair game, at the moment, though, we're particularly interested in targeting users and companies who already use Python, they may not be super Dask sophisticated. Well, let's say they're using scikit-learn for Machine Learning, which is a really popular and fascinating and incredible open source package for Machine Learning. A lot of them are doing like these large scale ensemble models and hyper parameters sweeps, and big grid searches and that type of stuff. And when they're doing that, they come up against compute intensive issues where their workflow could run for hours. And so part of my job is to let them know that Dask is an option for them to distribute it locally and in the Cloud, so it's really the Pythonic users currently, but in the end, if there are R users, for example, or people using MATLAB, who would find find this useful, I'd love to have those conversations at somepoint.
>> Okay, well, so share with us. I know you're passionate about the topic. Share with us a couple of stories of where this technology is being used to make a difference in the world.
>> Yeah, so I'll give two very different examples. One is at Walmart, which I love this example, because supply chain analytics and product forecasting is as old as data science, as far as I'm concerned, in the mid 20th century, Walmart was doing a lot of exciting stuff there, the other example is from cell biology and microscopy, so we'll get to that in a second. But Dask is used at Walmart for product forecasting of over 500 million store item combinations over a 52 week horizon. So they use Machine Learning or they use something called XGBoost in particular, they use GPU's, they use Dask, massive data sets and RAPIDS, which is from NVIDIA. And what this has done is their product forecasting is a 100 times faster. So it can happen, what used to happen on the order of several weeks can happen in under a dyna, which allows them to make decisions far more quickly, but also allows them one of the big things, external events, impacting product forecasting, whether it be weather, for example. So if the weather changes, they can do that in a split second, as opposed to waiting several days to get those results in.
>> Interesting, and then what was the other one? The other one was more scientific, I believe.
>> Yeah, yeah, so this is really exciting for me because the users, it's a tool called napari, and the people who use it don't even need to know anything about Dask, which is awesome, and even abstracts over Dask. So the challenge is that you have a cell biologists who need to do interactive exploration of high resolution cellular microscopy of high res images, which can be hundreds of gigabytes, right? So they need to do all these transformations, and pre-processing looking at it from different angles, this type of stuff, the data sets are so big that if they wanted to do one of these transformations or pre-processing previously, and when I was at the Max Planck Institute, I saw people do this, they do all their pre-processing, then come back the next day and look at it, and then do a bit more pre-processing or another experiment come back the next day and look at it.
Now with Dask, it parallelizes massively all these operations in the backend. So in the party they have sliders, they can just move a slider and do this in real time, which essentially means they stay in a flow state with their work and I don't need to wait days in order to break up their work into those chunks. So the gains in productivity there for them are exceptional. And I mean, it's napari is used all over the place. Now they use it at Harvard Medical School on the East Coast, as well as Chan-Zuckerberg initiative on the West and a whole variety of places in between as well. So that's an example, that's close to my heart and impacting basic research today,
>> Right, I wonder if you could drill a little bit into this kind of this role of open source with a commercial entity. 'Cause it's an unparalleled driver of innovation. We know that across all the different open source projects and things can just run. But as you mentioned in open sources free like a puppy, you still have training and integration and all kinds of other stuff on kind of a pure open sources is a really great opportunity for a company like Coiled, but you're still very active in the open source community. And there's still a ton of innovation in the open-source community and you've got this giant library or excuse me, the population of libraries that you can pull from. So I know you're passionate about kind of the magic that can happen with the combination of open source and a commercial entity. I wonder if you can give a little bit more color there?
>> Yeah, absolutely, there are several ways to slice this. And the first is the fact that I feel on average and this is a massive generalization, but on average, open source is incredible at meeting the needs of individual users, but not equipped to meet the needs of enterprises. So you can imagine, if you think of an organization or a company as a graph, it meets the needs of nodes, but not necessarily the edges such as communication between IC and IT and team leads as we've discussed. To drill a bit deeper into that, one of the reasons I think that's the case, and this is one of the reasons Python has been so impactful is that the open source packages in Python data science weren't for the most part built by software engineers, they were built by research scientists and users who needed them for particular tasks.
And so, Matthew Rocklin, his background is in physics. Brian Granger and Fernando Perez who built IPython and Project Jupyter, Jupyter notebooks, their academic basic science researchers among other things, Wes McKinney was working on his problems to come up with pandas, right? Travis Oliphant similar for NumPy, So the point of kind of that slight detour was to demonstrate that this is the reason that open source is so good at meeting the needs of individuals, but we don't necessarily have large scale organizational adoption because it won't meet all of those needs that we've discussed.
>> Right, it's funny you say that I saw a Ted talk with Linus Torvalds from Linux Fame. And he specifically said just what you said. He said, I've never built anything that I didn't just want to build for myself. And then he was fascinated when everybody else happened to have the same need and invested time and energy in the same problem.
>> That is so key, and that raises another point that I think is very key in this conversation is the bus factor of these packages, the amount of people who you would not put on the same bus, In case it went down, there's maybe, I can't remember what the number is, but on the order of 10 people who contributed so much to the Py data stack, right? And if an enterprise requires certain things from that community, how do they actually get them? So, and it's absolutely unclear on whether the open-source community should be responsive to enterprises like that. So that's why I think one solution is the evolution of companies such as Coiled to be the connective tissue between open source and the enterprise. And that's why it's key that such companies, as we are born from the open source, with people who are very, a part of the open source community as.
>> Great, right, very exciting. So you guys recently put some more money in the bank, which is great. So you've got a little bit more powder to work with. So as you look forward, I can't believe we're already halfway through a 2021, which is ridiculous, but what are some of your goals for the balance of the year? You've got some extra funding and you're just kind of another step function in your guys' process. What are you looking forward to? What are your priorities? What are you working on?
>> So one of the things I'm most excited about is that Coiled has doubled down in the way it's contributing back to the open source community, I think which is, as we've spoken to this kind of tension between open source and enterprise. So we're hiring a bunch of people who are pretty much full-time working on Dask and unrelated technologies, which is incredible, a lot more engineers to work on the product as well. What I'm really excited about is getting out there as we enter the post post COVID world and actually meeting a lot of users and speaking with them, and developing content and material for them and by them as well. And once again, telling more of those stories.
>> Well, Hugo, you're passionate comes through your excitement for the space and for the technology. And for more importantly, the solutions that it can deliver really shines through. So congratulations to you and the team for getting through 2020. You started the company right at the beginning of 2020, right, it's crazy.
>> Exactly.
>> Probably wasn't part of the plan. but and now you got some money. So I think you guys are in a very, very good position. As I said, everyone, John Chambers, it list goes on and on, AI is the most important thing that's ever happened since the internet. And it's going to be used all over the place in so many applications, but you said, somebody's got to write the algorithm. It just doesn't happen by by accident or just by magic. Someone's got to sit down and actually do the work.
>> Yeah, and we want to help the people who have to write the algorithm write it, so they don't have to have all the other stuff.
>> I love it, I love it, and test it and this and that. Yeah, all right, well, Hugo, thanks for taking a few minutes out of your busy day and as always great to catch up and great to see you.
>> Appreciate it, thanks so much for having me on. It's always great to chat.
>> All right, me too and stay safe and be well. All right, he's Hugo, I'm Jeff. You're watching Turn the Lens with Jeff Frick. Thanks for watching, we'll see you next time.