Data Science, Open Source Software, and Their Communities

About

In this episode, I was fortunate enough to speak with Peter Wang, the CEO of Anaconda. Anaconda is a distribution for the Python and R programming languages for scientific computing. It’s recognized as the heart of the open source data science community.

Peter was an incredibly engaging guest who provided really unique and insightful answers. We managed to cover a lot in this episode, including:

– The challenges Anaconda has faced using OSS as part of their business model.

– The importance of giving back to OSS communities you rely on

– Two common misconceptions the business world has about software

– The maturity of adoption of data science versus open source software

– Whether data science is a fad or a phase

Transcript

hello and welcome to open source for business brought to you by open teams i’m henry badgery and i’m the growth
marketer here at open teams in this episode of open source for business i was fortunate enough to speak
with peter wang who is the ceo of anaconda anaconda is a distribution for the
python and r programming languages for scientific computing and it’s known as the heart of the data science open
source community in the world peter was an incredibly engaging guest who gave really really unique and
insightful answers and we managed to cover a lot in this episode including the challenges anaconda has faced using
open source software as part of their business model the importance of giving back to
communities that you rely on two common misconceptions that the business world has about software
the maturity of adoption of data science versus open source software and finally
whether data science is a fad or a phase this podcast is sponsored by
open teams the first market network where users of open source software can find vet and contract with service
providers now that the introductions are out of the way let’s cue the music
[Music] [Applause]
i thought we could start by going through a bit about your background and how anaconda got started
yeah sure um so my background is uh i guess my educational background is
formerly in the field of physics um i studied uh physics and sort of quantum physics at cornell
and then shortly well just after graduating from college i decided to go into the software industry and i
went through a couple startups and small companies and as i started doing consulting in uh
using python and the scientific and numerical python tools i started realizing that there was a
potential for much greater impact for those tools and so around 2012 i started the
company what was called continuum analytics when we founded it but i started that company with uh travis oliphant who’s the creator of
numpy and scipy um and um and uh travis and i you know really i think
shared a vision about the potential impact of python for data analysis in general
um and so um that’s really i guess the journey in a nutshell we’ve done many things since then but
but really it was the um i think a period of time in the mid to late 2000s
when i really saw that something um there was much greater potential for python and that’s why we
started the company to really put python on the map okay and for people that don’t know travis elephant is the ceo and
co-founder of open teams the company’s podcast and there you go yeah so you go a long way back
and i’ve seen you two interact at um at conferences i know the pi data conference in austin is quite funny
uh travis was giving a talk and then peter was at the back kind of asking these fantastic questions and i could just tell they
were good friends from a long time um i also listened to a few of the videos on youtube and found that it was kind of a bit of a
thing that you two do you both see each other’s talks and and sparks and debates
yeah travis and i um it’s really quite a blessing um he and i are different people in many
ways but we share many many of the same perspectives um to have two people who are quite
different go through such different life journeys but then have so many points of commonality in terms of the technical
perspective perspectives on the world and and business and things like that um it’s really uh truly a
um i consider myself greatly uh privileged and and very humble to have this friendship all these years
yes and can you talk about the evolution from continuum analytics to what is now anaconda
well yeah that there was the evolution of the company itself i see it as one company of course we renamed ourselves
to anaconda i can tell you a bit about why we did that um and and the reason was because
the product that we created uh very early on we created the anaconda distribution three months
into the um uh company into into continuum analytics it was march
of 2012. and so we created this product uh we created this i mean as a tool it’s a distribution whatever what have you
but the idea was to bundle all the different dependencies you needed in order to run uh really get started with python for
data analysis um and uh and really python for big data as well so it wasn’t just
python for data it was rather big data so that’s why we call it anaconda and um we really hit a nerve with that
like a lot of people really were just like oh well this is this is fantastic because now i don’t have to figure out all of these different
dependencies and packages and how to install on these different platforms you know these guys have done the hard
work and we can just use it so that took off like gangbusters and by about 20
you know 2017 time frame we’re like you know what um we got tired of going to conferences
and saying hi we’re continuing analytics and people are like hey uh okay that’s great and then we say oh
we’re the ones that make anaconda and then everyone’s like oh we use that yeah we love it and that’s the
once it happens to you like a few thousand times you’re like maybe the world’s telling me something maybe we should change it maybe we can
change it yeah um many many companies have done that you know so we’re we’re just um following where one of a long list of
good companies have done that so and how many users does anaconda have today
you know it’s really hard to put a precise number on it but to give you some sense of it um every month we have about a million
unique new downloaders of anaconda and miniconda based on looking at ip addresses um every month on a every month we also
when we look at how many people are using our package repositories and how many people are sort of repeat downloaders
that’s about four and a half million uh active monthly active users when you go and you look back 12 months
on a trailing 12-month basis we have over 22 million unique ip addresses that hit us for
package downloads for uh installers and whatnot and it’s been particularly interesting because you know ipaddress is a fairly
crude measure some ip addresses hide 10 000 users some people bounce from
coffee shop to home to office right it’s a very crude measure but as we do some data analysis on you know
the average number of packages people download what their usage patterns look like what they’ve looked like over covid when
people are less mobile and more working from home the numbers give us some confidence that we have in the tens of millions of users
across the world wow that is incredible and what an exciting journey to be upon a part of seeing that grow and to what
it is today but what are some of the challenges that you faced using open source as part of anaconda’s business
model yeah it’s really it’s interesting um uh travis and i both are big believers
in the open source community where however we’re not like um like there’s some people who are
what i would call open source zealots right who have a very very strict like everything must be free and open
kind of approach you know no proprietary you can’t charge for it there’s people who are in that camp and
travis and i are not in that camp we are i think it’s important to make that distinction about how much we love and support the
community and the growth of the innovation and maintain our community around the use of open source
but the open source software itself is um really it’s an artifact it’s a means to an end it’s something that that
community produces so the business model that we have at anaconda is to um you know we have a couple of
different things we we sell uh commercial products to businesses who use who have open users of our open source
um we create a lot of innovation libraries with incubated technologies like number
uh the compiler poke the visualization library um and uh and desk distributed computing
and there’s many others besides that we’ve funded over the years but um we don’t charge for those
libraries right we make we do we do the innovation work we work with the community and we continue to
try to shepherd the open and free development of those things what we do charge for is we charge for
the um uh the you know the commercial servers and just recently this year
we um well we i say we but me really i looked around and i realized that we
needed to fundamentally change the economics around this open source community um and uh and so we made the decision to
change the terms of service for our um package repository service to where now um people who use it for
commercial purposes who are um at companies of a certain size you know more than 200 people
uh we ask them to pay uh a small a modest fee like 15 a month and then the price goes down if
you buy in a volume if you you know buy it for your company um but the idea there was that i
realized that the open source community although it was really good at creating early innovation and um uh
you know 10 15 years ago at this point in time the world has changed a little bit and
so people are um the open source community has gotten very
it’s a whole swirl of different things so there are big big companies who use open source to try to capture uh
users and capture devs into proprietary apis right there’s other people who use open source
smaller companies maybe not big companies but they they use open source the loss leader so that you get hooked on using this open source thing
and then the only place you could possibly then go for the premium features is this one company which they’re the only ones
that maintain that open source right that i call that soul vendor open source um and uh and i realized that all of
these kinds of ways of using open source they are by all you know measures they are truly
they are bona fide open source but there’s something missing in them and the missing thing is that they’re
not actually generatively reinvesting into a community of innovation there’s a
commons of innovation that’s not being invested in and the only way that i could defend
the pi data and the scipy the scientific python community that i so cherish and love is if we started
this process of getting businesses that use these tools to just pay a small amount
but pay in if we get everyone who uses it to pay in we’d have more than enough to fund all the maintenance to fund tremendous
amounts of wonderful innovation um and so it’s really at this point um
you know i think i think that’s that was that’s one of the ways that we make money now and is is that we did change
the terms of service and we coupled that with a dividend program which we can talk about in more detail
but that’s our commitment to giving back to the community so okay and yeah just since we’re on that
topic now i thought it was a great initiative when i recently saw at the end of october you released
anaconda had released uh that they started this anaconda dividend program
um can you talk a little bit about that what what is that yeah so it’s very simple we are simply taking a portion of our
revenues and we are um we’re giving that to uh we’re working
with num focus to administer those funds but we’re going to give those two open source uh foundations so
uh obviously the num focus foundation is one of those we may make contributions to psf or to other foundations but um we
we wanted to uh you know as we change the terms of service um we wanted people to understand that
this was done in conjunction with this commitment and this covenant to the community so we made a commitment to
um you know through the end of the year 10 of our individual user commercial subscriptions 10 of that revenue um not
not profit but 10 of revenue would go directly to this dividend program and then next year
we’re going to increase that we’re going to increase the scope of that we’re going to give one percent of all company revenue
and i know some people may think one percent isn’t that much but um it’s uh when you look at companies and
donations uh it’s actually it’s it’s a non-trivial amount my goal is to actually increase
that to be even larger over time but i right now i can make the commitment to one percent um but it’s um yeah it’s it’s you know
it’s a start and i think you know one of the things we see in this world um well one of these i’ve learned in my 40
years here is that um oftentimes the best way to criticize
is to create right if we can show people an example of what good behavior looks like then we can ask other people to kind of
get get in line so um so that’s that’s one of the the motivations there and and i think the
response that has been overwhelmingly positive i’m very very encouraged to see that definitely and i think that’s you are
you one of the first or the first in the industry to do something of that scale uh one percent of revenue
i think there’s um in the software industry i don’t know i haven’t actually done a deep dive on that
there are some companies like i said who are like soul vendor uh open source kind of things like if you have then there’s mongodb
right if reddest and the redis company or you know res labs but uh but in the case of a company like
us where we supported a community of software developers to actually go and put a
portion of revenue directly and to give it to the nonprofit to administer i don’t know
um others like that off the top of my head obviously there’s companies like red hat uh which do a lot of open source you
know and the big cloud vendors google microsoft amazon they all they all do a certain amount of open source of course uh of course it’s not one
percent of their revenue let’s see very clear um but uh but yeah we may we may be one
of the first i don’t i don’t know if we are not if you if we are great if we’re not you know that’s fine um
but uh but i do think it’s something more people should do definitely and i think this is gonna
just really drive innovation even open sources seem to get to the point today without that kind of
contribution and participation from companies but i can just imagine how big it i can’t
actually imagine how big it’s going to grow if we can get money behind this and may help people one of travis’s uh
missions is to help people be able to turn their hobby into a career and so if people would do that if people
could work on open source full time then i think it’s a very exciting future ahead so
one thing i was going to ask is why is it why do you think it’s important to give back to the open source communities that
companies rely on um
well if if you don’t those communities languish and the companies they pay they have to
pay more ultimately for innovation i think that open source
dollars and funding open source innovation is the most singularly most effective use of
capital i can’t think of a more effective return on investment
but the problem is that it’s a commons and so uh even though um the return the value is there it’s
hard for people to attribute okay which dollars paid for which things which produ that produce then which
outcomes right it’s very hard to do that kind of um spreadsheet tracking and therefore
corporate mentalities have a hard time understanding why they should put dollars into it because they can’t close
the loop on how those dollars got spent but when you look at it in bulk and in aggregate and you we have 20 years of history to
look at now it’s very clear that if you give bright people
the space to form communities to work with each other and to try new things and quickly and
rapidly iterate then you get this incredible pace of innovation
um that uh that everyone benefits benefits from so yeah i mean i use the term innovation
commons a lot i don’t know that many other people do but it’s certainly the way i think about it it’s more than just paying a
developer it’s more than just paying for software artifacts it really is about sustaining
a community well the community the um some of the compute infrastructure and some of the things like conferences and
other ways that that helps support the community’s uh vitality but investing in that infrastructure and
investing in those innovation commons um is is just really really important i mean i’m
very pleased to see like chan zuckerberg do so much innovation and um and lots of other foundations sloan and others have
done really great work there so um you know we’re just hoping more corporate people would show up but my experience
has been with large companies even the biggest companies the wealthiest companies on the planet
the way that companies grow and the way they are constructed every dollar in every budget is already
spoken for so for someone to come along and say you know what hey we need to put up one percent of
everyone’s budget we need to take that away and we need to put that into this untrackable untraceable unaccountable
just we just need to sprinkle that like top soil uh people are getting very mad right everyone’s going to be like well
you know yada yada like my kpis and this and that crois roi
you’re like yeah i don’t know you know there’s it’s easy you know it’s really interesting um the convexity of the human of human
cognition if we’re all um hurting ourselves by doing something we can you know usually come to an
understanding and say oh you know what we should stop doing this because this hurts everyone right we’re all polluting in the water and now we can see the water is brown it
tastes yucky that’s easy to see and even then we have a hard time doing the right thing but in a case like the innovation
commons when it’s like you know we should all go and tend this verdant forest that’s yielding uh
unbelievable incredibly rich and luscious fruit for us it’s very hard for people to understand
that so i see myself as the steward um you know i run a little cooperative farm stand on the edges of like a great
amazon jungle and i’m just trying to encourage everyone to kind of give back into that that ecosystem that biome
and one thing we discussed the other day in a prequel uh was this idea that the industry has a misconception around
where the value comes in open source most people’s a lot of companies see that it comes from the source code
but really it’s the community you were saying yeah i mean the source code
is um the source code is the fruit and it’s fairly raw fruit um the community is the tree is the soil
right the water that bears that fruit ultimately and um i uh i gave a talk at pi data
uh berlin um about a year and a half well i was a year ago um and i
i said um that software isn’t just code software is a
relationship and it’s a relationship between the users and the developers of the software
and if you don’t understand that as a user of the software um well if you use like consumer
software it’s sort of like oh i get a code drop i double click on a thing it runs and that’s what it is but if you’re
especially if you’re a business user or if the software is something integral to how you think or how like your airplanes fly or how
your trading systems run you know you really want to understand where does this come from how does this
how is this sustained what happens next with this thing right um so you have to see that the software
is like a river right it’s just this ongoing thing it’s a continuous flow of
change responding to either changes in the underlying needs as well as up you know seeking
upstream innovation it’s a river and an actual piece of software like a code
uh release that’s a scoop of a scoop of water out of that river and of course it’s important you have to
get a scoop of water if you’re gonna drink any of that water you have to kind of scoop it up but you have to understand that it’s
actually this this flow it’s a river right um and you have to ask about what is my relationship with this river
so with open source software it’s this incredibly abundant uh flow of innovation that comes down
very delicious tasting water and so i think the people who benefit from it it would behoove them to do a
little bit of thinking about strategically how do i make sure that this water keeps flowing so
um yeah i do think that unfortunately the business world in general has two great misconceptions around the
world of software right like one i just talked about it views software as an artifact not as
um merely the result of a relationship between the user and the developer and the second misconception
is that developers time is somehow fungible and is um
um prop that it’s proper to treat developer time as simple labor wage labor economics
um just like if they were a blacksmith you know hammering on some iron or if there were a you know lumber out
in the woods cutting a cord of firewood um and the thing that’s true that i’ve seen about software innovation
is that all these different people all their minds are different and there is an extremely large
gradient of skill of insight creativity genius artistry
craftsmanship uh among different kinds of developers and so the business world
seeking to impose labor economics on what is ultimately actually kind of an
artisanal craft um is it’s it yields very very low
dividends um it’s incredibly poor if you were to get cios ctos together in a room get them a
little drunk and really talk like no bs about how effective are your dollars spent
in it especially in the area of software development for like internal business applications i would be shocked if it was more than
15 effective i would be shocked if anyone would claim it was more than 50 effective because
um ultimately the you know the the industry as it’s
evolved the last 20 years it’s been more about managing risk than about seeking excellence
the open source community however and we see this tremendously in the pi data community
for instance and it’s only one of many but the the um these open source communities when they’re when they’re managed well
and when they’re positive you know engagements there these are craftsmen communities and
almost you could say they’re artisanal guilds that yield that really are able to find promising
new talent they’re able to then hone uh new tools new software new methodologies
approaches that are way better and they’re able to do it with not many people not many people at all not much
dollars invested they’re able to yield much much better outcomes because they can harness that high variance you know the
five sigma sort of positive outcomes and in sort of most corporate managed labor
economics you’re managing to avoid the minus two sigma kind of downsides right so um
anyway that’s that’s a lot of pontificating about that but i think those are two great misconceptions the idea that software is a static artifact as opposed to a
relationship and the idea that software developers are somehow just wage laborers and not
um actually the kinds of craftsmen that they really could be and i really love the analogy from the fruit
down to the soil and even software as a flowing river i thought that was such a great way of describing it so i’m gonna
add that to my analogy bank thank you peter um since you have done such a fantastic
job of growing such a strong community around anaconda and the projects like you said bouquet
number i thought we could go through um would you be able to give some advice
to the listeners the companies that are trying to grow strong open source communities or even just whoever’s listening and
they want to grow a strong open source community what have you found to be some key learnings that
you’ve taken from your last few years in anaconda oh yeah that’s a great question um
well there’s there’s many things i think number one i’ve always you know i’ve always been pounding my fist on the table about
about community right it really is important to um think about community but
it’s also important to recognize that um you know there’s different modes when a project is first starting out um
you it really requires kind of that that singular vision um and so the most
successful projects have you know a single person or a small very small group usually no more than three or four people that are able to
work very fast together so if you’re just starting out on a project i would say rather let’s say
you’re a single developer and you love this thing you’re doing and you want to get it out there and get more people to use it
and get more people to help you work on it um your focus shouldn’t be on trying to get a ton of people to come
and use it initially your focus should be on trying to make your project’s core
vision distinctive enough cogent enough that you can really find the people who
deeply resonate with it fine find your find your your your tribe right your champions
right so so make it you know really focus on the communication of the project’s vision what it is
and you know oftentimes it’s just as important to articulate what it isn’t and find the people who love it love
that vision can help you once you get the champions together find the one or two and they can be
they can come from very unlikely backgrounds you never really know what you’re gonna get and that’s the beauty and the wonder of of open source in an
internet era right you can find incredible developers that could be 12 12 hours offset from you in time zones but
they’ve got for whatever reason exactly the same ideas as you’ve got and they love what you’re doing nourish
like work with those and really water those little sprouts and once you get your first few
going um you know work to build yourselves into a small clan right into a small
tribe and then at some point once you start getting your first few you know first hundred or so serious
users once you start seeing certain amount of traction uh start that that point you should
start thinking about the onboarding process how do you get a few new community members in how do you you know put in a work on your
documentation really start thinking about documentation as a product unto itself
and then realize i would say the other thing i would say to to people starting projects and thinking
about this problem is that realize that if you’re going to be successful it is going to be a journey not only for your project but for yourself
because you’re going to be thrust into a leadership role whether you like it or not and if you’re a natural extrovert like
myself you have some advantages there if you’re not a natural extrovert or if for instance and this happens a lot you
know let’s be honest a lot of the software development community um is in the western world it’s english-sized if you’re english if you
you know i know many people who english is not their uh primary or their first language and they can feel reticent or they can
feel shy about you know um speaking and whatnot so if in any case you feel like you have
challenges being a leader for your project then uh recognize it you know and um and
and you know address it tactically so build uh a small coterie of lieutenants that you
trust right to be your inner circle to be your voice um and uh but but of course if you want to
continue to be the leader of your project maintain the moral authority maintain um the technical vision um
and and learn how to be vulnerable you know i think leading through servant leadership and vulnerability is uh really great um
once you start getting to where you have a dozen developers working on your project you have to start thinking about what are the values of this community
right what are the technical values of the project but what are also the the the human values of this community
that i’m building and these are questions that are really really important because um i i can i can imagine just myself 20
year old self listening to this and being like oh whatever a bunch of fluff but really but really it’s it’s important because these
are your scaling problems you know all all problems ultimately become human problems and if you want your software project to
be successful you’re going to need um to manage the people who help
shepherd it to success um and if you don’t want to do that again you can you just need to recognize that
and say well what i’m going to do is ultimately always going to be a small project that’s my little side project and it’s going to glommed onto the side
of someone else’s thing or someone else will maybe five years down the road have my idea and take it to great success and
raise millions of dollars and whatnot and i have to understand that right and i have to understand that i’m
okay with that right so a lot of it is just self-awareness in the leadership of the project you know lead or do not lead but there’s no
wavering in the middle you’re gonna make no one happy um i think what else is there i think those
are some of the key things the other thing i would say one one really interesting learning on the pi data ecosystem
is that we have just um i wanna i’m gonna try to avoid waxing too poetical
about this but i think the fact that it started with the scientific computing community was really quite a blessing because the
scientists that we managed to pull into this ecosystem this community the sci-fi community um
many of them uh or by nature fairly humble people and that humility allowed
them to work on projects to sort of have their own scope and then to also
recognize the importance of working with other people and so there wasn’t you know sometimes the software developers well people in
general i guess you’ll find a certain megalomania like well my project you get extended to do this and if i
just wrote that i could do what your project does we don’t need your project right um but instead what i saw in the sci-fi
community was that both at the technical level at the api level as well as at the community level between projects and groups
there was generally a camaraderie and there was a sense of like oh these people are doing this cool
thing we should be aware of that how do we make it so that our tool is easier to work with their thing how do we build compatibility
right and and of course there would be competing projects with different visions on things that’s okay um and everyone i think that that
that blessing that we had of having some really egoless leadership early on in the overall community has been
um fundamental in making it scale if it wasn’t for that then we’d have a few projects that want
to take over the world they would fail as all projects to try to take over the world do and we wouldn’t be able to have this
cellular scalable decentralized kind of ecosystem of libraries of tools there
are hundreds and hundreds of libraries in the pi data sci-fi ecosystem that people rely on
on a daily basis and there are thousands of thousands more in the greater you know um uh ecosystem that play with
those things we can we could not have been this successful if it wasn’t that big of a tent and that tent did not
have that many tent poles so i would say that um that’s another learning i don’t know how applicable
that is to any random open source project but that’s something i would say is relatively unique based on what i’ve seen in the pi data ecosystem
that was a very rich answer and thank you for that i appreciate it
but it was a great answer i have seen and just been amazed by the pi data
ecosystem is there are there other communities around the world that are like that that share similar qualities
say for web development or embedded systems is that any open source community is really strong
and close-knit really really close-knit as my data i haven’t really stumbled i i’m sure
there are i mean i i just hang out in the pi data space um but uh i think there’s a lot of gamer
and mod kind of communities that are also very see i think the thing to look for is
generativity so i think in in generative spaces
you can find these kinds of communities because people have an abundance mentality right um you’ll find you know smart
people let’s say you have two really smart devs right and they can either like butt heads or
they could you know collaborate and kind of each work on their own different vision but in conjunction with each other if
they believe in a finite game if they believe in a scarcity mindset they’re going to butt heads because like hey i want all the pie no i
want all the pie yeah but if you if you both believe the pie is growing very very fast if we work
together we can each get more of it right yeah we each get 1.5 of 3 versus each of us getting
you know point eight of one point six right so um so i think you have to look for
generative spaces and generative communities and i think uh gaming ones and ones where you know
people are modding games just trying to create beautiful new wonderful things there’s less natural inclination to to view people as
being competition with each other maybe so i don’t know that’s an intuition i have about some of that i think
the traditional software development community you know there’s many many of them um but they all get fairly corrupted
with money fairly early on um there’s a lot of money glory glamour
the vc lottery there’s all these dynamics that impose a scarcity mindset
onto people right and um and that that corrupts the tree at the
root i think is yes so is data science and data analysis or open source software in that
space fundamentally different from open source software for developers and infrastructure
well i think at this point we have made it so because the software community that that we built right is fundamental to
this practice area if we hadn’t done this i’m sure it would be a couple of big companies
that were funded you know to go and make a ton of money to create proprietary walled gardens i mean it
would be sort of like that i mean i you know i think i think we’re very lucky that our the tools that we loved in the
community that we we built uh was um well i guess it wasn’t just lucky it was quite intentional right i
really did quite believe i sort of said look we should we should create this and we should push this into
the space so it was an intentional thing um but i’m glad it worked out as well as it has
i think it’s really important for the future of the world that these underlying numerical tools remain open um so um
but i do think that the software what one thing that’s really fundamentally different about the open source software
for data analysis versus software for developers is that um we’re
hopefully you know we’re trying to cater to an audience of people who are not software developers they are more users than they are
developers the software development community it you know the maker community at least in the open
source world they’re trying to serve kind of each other right their audience is other open source devs
um and if they go and produce a really nice usable piece of software that’s that’s great but that can be really hard and um and i
mean this without really much criticism but just my observation is the linux open source community for instance has been
it is the most successful you know by far of any open source community but the linux open source community
actually the new linux open source community has been far better at creating server-side software tools
and developer tools and things like that than tools for end-end users right yeah um
because the golf is simply so large you’re building this thing you you you’re you’re a wizard you’re
chanting all this code into a text editor right and then you compile it and you wrap it all up and you give it to somebody
and you hope it works for them and it doesn’t zap them and that’s that takes that takes a lot of skill and
craft um it’s much more fun to make like cool enchantments for other wizards to use
because they know how to not hurt themselves too much right um but i think the the data science community well
certainly i can just speak for the pi data community i think the r community may be a little different but it’s related but the pi data community um what’s
important about this to understand is that the people who make the software predominantly don’t come from a
background of computer science most of them don’t have professional software development experience certainly the ones who laid the
groundwork uh the fundamental library so matplotlib numpy scipy uh jupiter ipython right um
pandas like all of these things are made by people who needed to make something for
themselves and python was just good enough and just powerful enough to where they could
customize it and they kind of got sucked into it and they kind of created a second career for themselves to some extent
but um you know maybe i’m one of those people but but really they come from a place of
great empathy for the end user but very specifically what the end user is trying to do is something also of a
higher order than merely someone clicking around on a webpage to order some food or to order a taxi or something right
i mean someone trying to do some like deep uh numerical analysis of some complex
multi-dimensional data set that is a that’s that’s a different kind of software in fact most software
developers unless they are in this niche of numerical simulation they would have a hard time writing
performant code that’s correct to do that kind of thing i would say 99 of your modern sort of full stack web
dev application developers don’t even know how to approach these kinds of things that
we’re building in the numerical world uh not that they can’t be taught it but just saying that in their work history
and in their uh it’s not just like oh just write a for loop here pull a feel a database there slap it in that
text display over there and you’re good to go it’s not that kind of work at all it’s much deeper work that we’re doing in the data
science kind of infrastructure space so no i think it’s definitely a different world it’s very different um
and and but it’s it’s it puts the end user much closer to the process of making the
software okay that makes sense and and where are we at in terms of uh
industry maturity for the data science space relative to the adoption of open source software
um well i think in general um the broad industry at large what i
see is that they’re fairly early still which is shocking to me because i’ve been advocating for open
source since like 1995 and everyone runs linux and everyone’s like doing all these things like they
use so much open source but i think it’s the um the mentality of most it managers
and certainly most corporate people off the chain who see technology as sort of like this
gnarly area to be managed right i think those people think of like i i don’t
think they think about open source in the correct way i don’t think they understand what’s really happening so i think that
that is that is still fairly um early or i go say it’s maybe it’s
crossed the chasm but it’s still people have have concerns about it it’s certainly not
you know broadly like everyone’s doing it the right way i mean i think that people are still
pretty immature in their adoption of of open source yeah unfortunately yeah
i think it’s starting to change though and definitely we’re seeing a shift not only in the adoption but the attitude in terms with regards to giving back a
lot of companies now they’ve set up open source program offices they’re contributing to the projects they use so i think
what does the next phase look like in your opinion is that going to be driven by enterprises i think that if
enterprises show up in a good way they can have an incredible impact on this um
i think if they show up in the wrong way they could really slow things down for quite a while they provide some real
headwinds for i think people trying to do the genuine incredible things so um to make that more concrete what i
would say is that um uh well well two two things
number one the phenomenon of software itself being a separate thing from hardware is
itself kind of a historical accident now for us the idea of like software is a metaphysical object that exists like
it’s not it’s like how could someone say software could not exist but from a business computing perspective it’s
only been the last you know 35 or so years that we’ve really treated software as a distinct and separate kind
of thing from the underlying hardware storage the overall information system right i
think um as we steer into um the era of broad machine learning and
cybernetics and ai people are going to start viewing these information systems again in a coherent holistic way
and they’re going to see software as merely one part of the overall system which is good i think that’s the
appropriate way to think about it it also then puts challenges on the
software industry because basically ever since larry ellis and bill gates decided that software could be an industry and said
hey you should pay for the software bit um you know a lot of people have just uh i would say axiomatically adopted that
perspective to say of course we pay for software and all the investors say well of course software is this massive part of the
value chain highly scalable 80 margins um that era may be drawing to a
close as we build systems information systems that need to be very tailor made
for specific problems they’re trying to solve that depending on the data set and the algorithm um you may have very different
set of software and hardware combination so i think people are going to be very surprised all of a sudden
that it becomes harder and harder to sell enterprise software the way it used to be
and um in fact all the innovation being open source compresses that space even more so we’re
going to see companies then adopting open source to make their own solutions they’re going to build great solutions
and if they just understand that they need to participate in the open source uh human ecology via the open source
program offices and things like that i think those companies could do really well they could really help um you know put more water and and
fertilizer back into the into the soil so but if they continue to view it as an
asset as this if they view this thing as like a a piece of some competitive advantage
you know i see that as well i’ve seen that mentality come from people um like why should we fund this piece of
open source innovation if that means my competitor gets it too it’s like well it means you get it as well though like and if you’re truly a
better company then you should you you would you’re simply compounding your advantages right
um i know it’s just like stupid things like that where it’s like they don’t seem to understand uh i i shouldn’t dismiss it quite so
much as stupid it’s there’s a mentality of like open source as being a very um scarce
good uh around software right it’s a scary it’s like software scarce and open source is
is therefore a way to get scarcity but really open source a way to tap into a far more abundant uh faster pace
of innovation and if your company is willing is ready to adopt that innovation then you can go faster you
can really harness that wind so that’s that’s the way i think about it i think that companies can
i i’m very hopeful about it i really am and i think um a lot of these companies
they’re either accountable to investors or to shareholders and then like we briefly touched on at the beginning of the episode if they can’t show what
the roi is then i can see why they have historically had this mentality but i think it definitely needs to change
um yeah i mean i think look it’s it’s uh enterprise messaging to stakeholders and investors is all
dressed in like 10 layers of corporate bs right so you just say well um you know in order to accelerate our
our digital transformation of the ai era we decided to make these investments into um open source and sorry we made these
investments into technology innovation right and we’re now part of these incredibly innovative
um collaborations with these universities and these other things and research centers and of course they don’t mention the
word open source anywhere in there right but but they can talk about the dollars they’re spending as them being part of
this you know touching innovation and touching the what comes next that’s easy to dress that up
and that’s that’s you know corporate marketing 101 right so yeah and as we get to the end i asked
one one final question and it’s a debate that’s happening at the moment and it’s been happening for a while now
in that debate is whether data science is a fad or a phase so which end of that debate do you
rely on um i think yeah so i’ve definitely heard that you know people some people think that
the the current need for for data scientist programmers will be obviated as we have
easier point-and-click tools um uh or maybe software developers will just learn the appropriate stats and
then we’re good to go um i think that we are currently in a certain
phase of uh what i call data intensive computing and so
data intensive computing is what it’s going to be for the next at least 10 or 15 years
it’s going to be what is i think moving into the future until we get to maybe the singularity if we ever get there
we’ve just had a weird exception for the last 20 years where we could do data
non-intensive computing but now that we’re back in data intensive computing um it’s more and more important to bring
the people who understand the business problem who understand uh the algorithms and the mathematics
and the people who unders and and and the knowledge of the compute and and information systems bringing that
closer and closer together data scientists are people who happen to be able to hold all three of those in their brains at the same time
so it’s very very frictionless interaction as the field matures i think we’ll still have those data
scientists becoming more and more masterful they’ll be able to harness these tools they’ll be able to you know
have better understanding of techniques so there’ll always be kind of an elite class of data scientists but at the same time as we create this
world into or as we as we uh become more and more of uh you know uh everyone becomes doing more
and more data intensive computing it um we’ll see specializations and we will see everyone imbued with a
certain amount of data literacy um so that’s what i see it as i think that data science
in its current form the obsession over single unicorns that know all these three things um that that will that pressure may fade
a little bit um because a lot of people are learning this and they’re upskilling and they’re they’re going to be able to you know
meet that market demand but at the same but also the way we practice this kind of thing we call data science
will start becoming bigger and bigger we’ll start maturing and we’ll start specializing into sub areas we already see it now
you have ml engineers that are different than data engineers who are then you know skilling up into becoming data
scientists you have business analysts and data analysts who are trying to use automl tools to take you know to kind of learn
a bit more about the actual statistical techniques so all of these things are happening in this space i don’t think
that it’s going to suddenly just poof resolve and then we settle back into the world
of 2010 when you have a database admin and a java developer and then a business analyst sitting from
a tableau or some you know graphical tool i don’t think we’re gonna go to that i don’t think we’re going back to that world ever again
i think that we’re going to see a greater and greater acceleration of businesses adopting these new data
intensive techniques and we’re going to see a rapid
stratification between the businesses who are making the transition and the business who are not so it’s going to be really fascinating
the next 10 years gonna be pretty intense um and uh but that’s my fundamental
thesis that we’re in the era when the first phase of the era of data intensive computing
anyone who makes it by anyone i mean any business that makes it they’re going to see that they need to
infuse all of their knowledge workers with data literacy and so a lot of what we
call data science today maybe data science 101 stuff today will be the minimum data literacy you
need moving forward um and that’s just what it is uh that’s my i’m calling that shot i
mean i believe that’s fundamentally what is going to happen we’ll come back to this in 10 years we’ll come back in ten years and watch
the episode we’ll see you see what happens we’ll we’ll grab a beer and we’ll see what happens but i’m pretty sure that
this will be this my prediction will be will be proven uh true um and two other predictions
while we’re calling it while we’re drinking a beer i will also say that we are now entering a period of rapid data heterogeneity
uh rapid information oh sorry hardware heterogeneity so we’re going to see more more kinds of
chips storage systems all sorts of weird architectures and all of your software developers who
thought they could just basically write a bunch of java code deployed to a jvm on some vanilla x86 box running in a
data center they’re going to have to relearn a lot of things if they’re going to have their skills be relevant in this new world
order um the second thing i’m going to that we’re going to see
but ultimately a lot of those new heterogeneous uh compute architectures are really going
to fundamentally be about moving as much computational capability as
possible to data and then having the data storage and the compute mechanism
be as close to the sensor as possible so we’re going to see massive amounts of sensor networking
edge compute all sorts of things happening that we’re going to basically at the end of 10
years be in a world of totally pervasive computing or look back at this era of cloud and everyone put
all their stuff into the data center and we’re just going to crank on the data center we’re going to see that as an incredibly primitive and wasteful
approach but instead we’re going to see a much more pervasive data storage compute and
sensory sensor fabric that is going to be the infrastructure for um for computing um so those are the
things that i would call yes i’ll call those shots we’ll come back maybe we’ll need two or three beers to go through those it’ll take two or
three thank you so much for joining us tonight peter i’ve really enjoyed thank you for having me thank you for having me for the
excellent questions thank you so much peter and for everyone who is listening or watching this video
it really would help if you can leave a review on apple podcast or leave a comment on youtube letting us know
what you think that really does help and support the podcast so thank you very much everyone and until
next time goodbye
you