Kedro Open Source Development

About

In this episode of Open Source Directions, we were joined by Yetunde Dada talked about the work being done with Kedro. Kedro is an open-source Python framework that applies software engineering best practices to data and machine-learning pipelines. You can use it, for example, to optimize the process of taking a machine learning model into a production environment. You can use Kedro to organize a single-user project running on a local environment, or collaborate in a team on an enterprise-level project. For the source code, take a look at the Kedro repository on Github.

Do you use open source software?

Find trust open source experts for any project or question today: https://openteams.com/

Transcript

hello the internet welcome to open source directions hosted by open teams the first business to
business marketplace for open source services open source directions is the webinar
that brings you all of the news about the future of your favorite open source projects i’m a mathematician
my name is melissa and i’m a software engineer based in brazil working with open source software and
communities at quan site and i’ll be your host for open source directions today
co-hosting with me today is hi emma deacon and i’m really excited to be
co-hosting for this episode of open source directions i’m a postdoc at the university of illinois at urbana-champaign
i work on an open source project called the yt project i’m based in urbana illinois
i should perhaps go next hi everyone uh my name is yatunde and i’m the product manager at quantum
block part of mckinsey and company um quantum black is kind of an advanced analytics company that was acquired by mckinsey a
few years ago and what we do within my unit is that we help build products for data engineers and data
scientists including one called kedro that we’re going to be talking about um it’s a python library that makes it easy to build production-ready
data pipelines um kendra is also mckinsey and quantum black’s first open source product which is pretty cool um we’ll probably
get into um that one um in my past you’ll find that i’m an mba graduate from oxford
and my background includes being a data product manager in the banking industry and i’ve also worked a lot in the
non-profit space but if we go way back i’m actually a mechanical engineer by degree and through work um and i’m
actually from south africa so if any south africans online say hello i’m based in london now
oh cool so well hello everyone um first of all it’s a pleasure to be here thank you very much for having us
uh my name is and i’m developer advocate for quantum black uh so i don’t need to introduce quantum
black as yet already did that amazing job um so my background is in
civil and environmental engineering and in customer service um i work with some
python i’m a python activist i volunteer a lot for it for the python course uh and i’m also
an undergrad and only recently started building a community around cadro uh i’m originally from brazil and but
right now i’ve been living in dublin ireland
thank you so much we’re so happy to have uh italy and nais today with us
but first before we start we will go to our famous tweet of the week section where
each of our panelists will present a tweet that they have been enjoying recently your first app sure um so i’m actually
gonna show you two we’ll talk about two um because i spend a lot of time on twitter but
quite a bit is it says friday fun fact in 1974 ramesses the second was sent on
a flight to paris for preservation and maintenance work so you know ramesses second was a famous egyptian um
but since french law requires every person living or dead to fly with a valid passport egypt was forced to issue a passport to
the pharaoh three thousand years after his death and the actual tweet includes like an image of his passport
um of what was used um but the second tweet that made me laugh was that um our product’s name kedrow um we do have
an active community of like uh people that tweet about kendra all the time but it’s also a person’s name um so when i saw a tweet that said
kedra probably um probably gotten only fans um i was i was laughing a little bit about that
naming things is hard so yeah life your first you’re you’re next
yes and so uh for me the tweet of the week was this one that i saw i don’t know if uh everyone is
familiar with the new uh christopher nolan movie called tennis uh like it’s it’s super confusing
apparently so the tweet of the week says uh if you’re confused about recursion
uh go watch tennis uh you’re gonna be even more confused but at least you have seen a good movie
that’s awesome um i’ll share mine uh this is something that made me laugh it’s a video of a guy who was uh playing
the piano and singing about 168 aws services in two minutes
it’s hilarious he’ll just go mention every service he can and then he edited the video to have the
actual logos of each project showing up as he sings them and it’s just great you should check it out
i love it all of these are so good um and i love the passport photo by the way uh it’s amazing
um so mine is uh is a link to a profile but it’s really like active this month
and it’s the mineral cup so this is a like contest that’s going on on twitter where people debate over their favorite
minerals and so slowly it’ll like debate down to a final match between two minerals
so you get to see some like very and so people vote on their favorite mineral in each matchup and you get to see some
amazing science communication but also some extremely nerdy discussion justifying why somebody might like
you know uranite over another uh cool mineral so i obviously vote for the
radioactive ones but mine are not winning so anyway check it
out it’s really cool so that’s great um yeah so thanks for that
and i think we can go jump into our main subject which is kendro
kedro is an open source python framework that applies software engineering best practice to
data and machine learning pipelines and helps optimize the process of taking a machine learning model
into a production environment it has about 3 000 stars on github
and it has also about 70 000 downloads a month across pipi and conda which is uh super
awesome um and i will be happy to hear about it today
yeah so starting off i’m really curious if one of you could tell me who started it why it was started and
what need it fills sure um so kendra has been around within quantum black for the last two and a
half years um we only open source at some time last year in june but it was originally designed um
by um aries machine learning engineers within
quantum black to solve uh problems that they were facing when they were working on project work um with a different client so you know
our model is we’re part of mckinsey we deliver advanced analytics solutions to clients and they ran into some
problems when it came to how they chose to collaborate with each other when they were trying to produce
kind of like production code or you can think of it as machine learning products that are
functioning um but following that um that version of uh kedro back then it was called carbon ai
um was actually um redesigned and rebuilt um and it was rebuilt as an internal
product by team that includes like materials errands even donov nicholas um khalsa
um mesam and nicholas nicos again was back on that team um so we find that
um the reason it was actually rebuilt as an internal product was because quite a few teams actually found that they had the same problems that the original
designers had um we think of it as kind of your way of um what because i’ll actually tell you what
our users say um the need it sells for them standardized workflows so this comes
about because when you’re working in kind of like the enterprise data science space you’re working in a company that
produces code as the final deliverable of what should be happening if you’re not working in a kind of like
standard way with your teammates things become like kind of hard to keep track of like what’s going on you choose
to set up your project in your own way handovers become terrible because you have to go and like find the person who
actually wrote the code base like forever ago try get them to explain to you what was going on and then they also problems
around collaboration we obviously talk about use of jupiter notebooks primarily for data science work
but when you’re writing when you’re creating a machine learning product um i mean for instance it’s even hard for
two people to work on the same jupiter notebook at a time instead of working in python script so caterer kind of solves a lot of
issues around this whole thing of standardized workflows um and then making it easy for us to
collaborate while we build this great software we’ll actually build data science code that needs to be turned into software
please do you want to complement that uh well that was very very summarized
i just wanted to say that um well that is what we’re trying to do uh i wanted to say that we’re not trying
to extinct jupiter notebooks uh we’re trying to we’re trying to integrate
uh we’re just trying to make uh the workflow everyone work in a bet with a better workflow we’re just trying to
enable communication between um in teams and between teams as well
and trying to make sure that everyone works in a streamlined way and um well produces very good
uh production ready code right from the start that’s awesome so can you explain the
history of the name and the logo for kedrow we’ve had so many names um because
naming things is hard leaving things is hard um the original name of um kendra was actually carbon ai
then it became kernel ai that was all internal before we open source i mean like we even tried to drop the ai
so it was just kernel but then we couldn’t call it kernel because there’s many things called many things called kernel right um so uh
the process of renaming kindred just before open source um meant that we went through i think at
one point the team generated 100 names um but had to meet certain criteria before we actually got down to the top
five and kidro was the winning one from that one i can i can say kendra was almost called burano
at one point um but i’m glad we said alan kidrow in the end
um in terms of how we think of the logo as well um we have a very like we have an incredible design unit
um within quantum black um and they decided to prioritize shapes um for how the kidner logo was was built um kendra
was one of a few um internal products that have this kind of like shape based very distinct um logo branding so that’s
actually what we went with and yeah and then the the name
uh it’s a metaphor a greek metaphor for a core so it means the center of the
earth so uh the way we see it is schedule is the center of your analytics project
and yeah it ties up very well and i love that when yesterday said about the was talking
about the logo our lead designer just showed up on the on the chat as well so gabriel thank you
very much for the work that you put in in the design um we love you and he’s also the dj he’s
the one that has been making the kedro the creator playlists you have a pedro playlist that’s amazing
edition 2 was launched today it’s the friday gift for all the releases we’ve been doing this week okay this is
an amazing process i feel like other open source projects should have playlists i would like to know for example what
the jupiter playlist is um okay so can either you describe to me
how you differentiate yourself from alternative projects out there
um so i think this is actually where we get to like the actual intent of why cather’s around um we think about it as standardized
workflows we think about it if collaborative data science we focus on the problem of
how do we write code data science code that is deployable how do we work together so that it’s
high standard well-tested code and it’s a proper machine learning product before we deploy it
but everyone gets really i think they get really excited when they hear that we have a pipeline abstraction um in in
kidron they’re like oh my gosh it’s like a pipelining tool and we’re like not quite um because the other pipelining tools that exist in industry
have a different focus they decide to prioritize this whole thing of how do we think about orchestration
and scheduling so how do we if we know that we have some code that we want to run um how do we make sure that it’s running
on sunday at 5am and if one database fails to run like it will retry until like it’s successful
um so we see tools in that range of like airflow luigi dagster
and prefect which focus on that problem specifically where we focus a full step behind that
which is like we know that you need to do some sort of experimentation while writing this production code
um how do you work in a structured way while you get it so that you have something that is deployable when it’s time to move um
the previous tools don’t actually look at the the back end of that which is they just assume that you have something
that’s that’s deployable yeah and i think it’s it’s interesting
to to say as well that like um we we try to
cover everything on your project so we try we go from the exploratory phase on your jupiter
notebooks we have on our template we have a folder with notebooks that all your notebooks go there and
there’s like a little bit of a tweak that you can do and use your cli as well and you can convert that notebook straight into nodes and
they go straight into your your workflow and we have testing them there as well
like we’ve done everything that you can think of that you you need uh when you’re buying you’re
when you’re building your data pipelines uh like it’s it’s from a to z
and then we have a little bit more on the on the back as well because there’s other also casual hooks that you can hook in anything that you need
uh and make your your project even bigger and expand it with more easiness i think
maybe because there’s one more point about the hooks thing um that i could actually mention um this has
actually come from our users in the open source world they were the first ones to actually tell us this and we’re like oh this is interesting
um but they call this kind of like the django for data scientist for data science data science or the
react for data science so we come into space where there are no frameworks for how data scientists
should collaborate together and create great code um so we’re among the first
in the space and we’re actually fulfilling a lot of needs around standardization of like how those
workflows look so sometimes i think to some users we do appear like crazy because everyone’s like why do i need this thing um but then
they run into so many issues which kedra solves for and then they eventually come around and they’re like actually we get why this thing exists um because
we run in we try to what what what actually this is actually the user journey it’s really cool
you run into issues then you try to build your own framework to try and solve it and then you look around and you realize
oh wait there’s this kind of thing that has everything um already and then they eventually pick up pedro um which is quite cool to see
uh so i just saw someone commenting on the chat as well about uh the visualization so i think it’s
also worth to mention um that we also have kedrovis there is our specific visualization tool
that it’s based on cli that you can just type casual fees and see exactly what’s going on in
your pipelines and we have actually a few users so they use that feature to debug their code so they
look at what’s what how they actually abuse they structure they’re like okay wait a second we’re building this big
system in our head no no no i need to just type casualties on the cli and see what’s going on with why
the code i’m writing so it’s yeah oh i can definitely see why you have so
many fans it sounds awesome and i i definitely think there’s a need for it
uh i just want to remind the audience uh that you can ask questions uh at the livestorm app and then we’ll
try to answer those in the q a section uh at the end so please type your questions
in and we’ll try to answer them later so yeah i think you discussed this a bit but what technology
is schedule built on like can you describe a bit uh how that’s done
cool um so we’re actually completely um python like it’s a completely python library
and we actually used to use a makefile within the library template to actually do our cli commands
and everything because we okay let me actually talk about the components of kedrow we talk about a project template um
generated by cookie cutter data science but modified to be like kind of like with the best practice of all the teams that have ever used kedra
before we’ve got 170 client projects that have used kedrow so it’s across industries and we’ve
basically built that feedback in we talk about a data catalog which is um kind of like our series of like data
connectors to connect to any data source local um cloud storage
um hadoop file systems if you’re using a pi spark workflow um and that uses um either the python or
yaml api as well so there is support for polyamol inside of kedro at least for the data catalog
um we talked about the um from there the pipeline abstraction
it’s just purely python and then we talk about kedroviz which is our pipeline visualization tool built
purely on top of like the pipeline script um and that is a react app um you can think
of that that works quite nicely it opens up a localhost server and you can visualize your
pipeline as you work um and that’s essentially the components of kedro we do support other things
i think if you’re deploying caterer projects there’s a lot of flexibility to do that in many ways and think of it it’s part
of our business model we’re consulting we work with whatever the client has so
we obviously have to deploy in many ways um but you can check out the kedro docker plugin
which will package your kit or project into a docker container so you’ve got everything your whole project packaged
in there um and yeah there are a few other options like that
i don’t really have anything to add to this on the on the sensor because well basically
that qp is a python house and i learned that on my first day and
that’s what we’re sticking with
um that’s perfect thank you uh for describing that i know we’ve talked a little bit about who started kedrow can um can you tell
me a little bit about who maintains the project because you’ve been using a lot of wiis uh like we are doing this right so i’m very
curious about who is maintaining it yeah there is there is no there’s no ways we actually have like a
large team um that actually sit behind kedrow um so i’m product manager um evan donov is the
ketter tech lead um on the kind of like python core side of the library and
some of the primary components there we’ve got richard um with tanner who is the tech lead on the pipeline
visualization tool he’s a front-end engineer um dimitri derriban derrabin who is a software engineer
lauren poland software engineer um and then we’ve got
lim um huang um who’s also an incredible um software engineer in the team um andriy ivanyuk another software
engineer um and we’ve got joe stitchberry who i guess you guys have been raving about our documentation
um you can thank her for that um we’ve got laissez um up as our developer advocate kind of
like as our interface between um kedrosa library and our open source community
with all our users um and then we’ve got meryl thiessen as well who’s another one of our software engineers and she comes in
um as we introduce like kind of like new features when we talk about the roadmap um i’ll be able to go into that one a bit more
um and that’s essentially the kid routine we’re family um and yeah it’s it’s been it’s a an
absolute joy um working with the group yeah like so i’m quite new to the team i
started working keep could be like around three or four months ago and the thing for me that it was was very
obvious to see right from like i think first week was how meticulous about code quality and
specifications everyone is like there is this very high standard of communication very high standard of
documentation internally as well and it’s it’s amazing to be able to be
part of a team like that because everyone is super nice especially with the newbie
here that breaks circle ci um checks that breaks github sometimes
as well yeah but uh yeah it’s been it’s been
really really cool and the team is absolutely amazing i i couldn’t ask work on a better team
well if you haven’t broken the ci you’re not really working right yeah i feel like all of us have broken
the ci at some point you see dimitri one of the the older
software engineers behind pedro told me when i submitted one of the when i merged one of the commits that wasn’t
supposed to be merged at all and i just went there and clicked merge squash
his question merged and i went back to him and i was like i am so sorry he was like don’t do that
again but again if you never broke get help um you’re not a developer like okay
i take that with me for life that’s it so can you describe a bit about uh what
communities and users are your contributors from so sure i can actually talk about even i
guess our support model as well so um in terms of like how the
okay so how cater team like maintains a repo is we have a you will note every week it’s kind of
like the same person who will be commenting on like github issues and pull requests and accepting pull requests and helping users through pull
requests because we have like a role a rotating role in the team called the ketter wizard and it’s your job to make sure that
everyone has a great time i’m coming to the repo you get your answers question your questions answered um and if you’re
posting stuff from stack overflow because we’re stuck we’ll be looking at those ones as well but the really cool
thing about this is that our network has grown beyond us um and beyond the kidro team just doing these so we do have like superstar
users across the web um you’ll see waylon walker is quite active blogging and answering
people’s questions um you’ll see um tam who is also known as data new engineer one if you ever check go google
that on youtube to actually see like an entire stack of like amazing videos about
introductory videos about kendra but he also publishes plugins about kedro um and then answers user questions um all the way to
like um well i guess maybe i think maybe one of my favorite communities is actually the kaggle um japan um
community where canada has been used by the grand masters of japan because it solves a
problem around reproducible workflows i actually see time is online oh my goodness um so um you you can
actually you can actually check that one out so um growing community of users um different use cases you’ll find that
kitters also use in academia as well um because you know there’s problems around reproducible data science
for papers that are published so students pick it up as their very real choice for making those things happen
but of course we know um it did evolve out of industry applications um so it’s been really really cool i
think laissez has been the pioneer of this work um as we connect with different companies that are picking up
um kedra across the globe um and we i think we’ve covered like over 200
users um just in the separate companies um but within quantum black and mckinsey we
also have hundreds of hundreds of users as well that we have access to so lots of places to get user feedback from
yeah like we have uh there is one of my i can’t say favorite because that’s
that’s too biased but one of the projects excited about
they have been using kedro is the open source um open source latin america um
that it’s basically this like huge community of coders data scientists and um data
engineers they have been working with public data from argentina and some other um countries in south america to find
out solutions to everyday issues so they use open data uh on a
non-profit organization they use schedule to find out solutions for those problems and that’s that’s
wonderful like it’s it’s such a big social impact it’s so awesome to have a user base
that is just so engaged with the community their own community there is another
continent and they message us on linkedin sometimes like yeah so we’re doing this and we’re so excited
and they’re like okay can you help we would like to talk to you because we need a little bit of help over here he’s like yes so everyone is
super engaged and everyone is trying to do the best they can all the time and to
make sure that we actually get to make sure that we actually get the the the results that we’re looking for because ketchup
is going to become the standard on data science and machine learning pipe data pipelines and that’s the code we’re
the we’re the react of data science that’s it i’m
loving this and also it sounds it’s amazing how vibrant both your user and developer
communities are i l it’s really fun to hear this from both of you um so okay i’m really curious is the
project participating in any diversity and inclusion efforts and if so what are they
sure i can maybe speak about how we do community management and then also how we’ve participated in sprints
um because those would be the the clearest ways to actually see it um we uh this actually like this would
be like beyond um before your time um but when we open source kid where we were worried about like
how do we actually present a good open source project before uh to to the world being mckinsey’s first open source
project these things are important um so there was one it was one weekend where i spent uh you know i think i must have trolled
the web looking through how people say what is best practice for community engagement and what should you do and what should you not do
um and that determined that made us a kind of like communication guideline for how we talk to our users
how we turn down pull requests that maybe aren’t aligned with the projects gracefully and kindly um and then also
how do we answer questions in the best way especially when we sometimes we’ve had a few trolls as well but we still deal
with them with the proper kindness and respect that everyone is due on the library as well so i think
in terms of like trying to create that environment across kedro we try to make it welcoming for
anyone who has to interact with us at this layer done in this way
um the second place would be we obviously we love participating in sprints so an example
of this one would be um maybe i work with the london python sprint group where um uh we had an organizer um chuck ho
um who came in and was like well we’re gonna have a lot of um newbies coming who have never
contributed to open source before um how will we support them with their journey of learning
um not firstly how to contribute um to open source and then therefore how to contribute to the cadre project
um so we had a great time um actually teaching people how to create their their pull requests for the first
time um because we know typically in this phase um black women and then um under represent or
minoritized ethnic people are very very underrepresented um so president was super exciting to see that
majority of the people now first time contributors we’re women we’re minoritized ethnic people um
participating in this project but it also means we do things like i will create um
github issues which have just a single fix for typo i mean it’s it’s faster for me to
actually fix it in the docs um but for me to go and like write a nice explanation of like
this is exactly what the change is and it’s just the type of fix is important to me because it means that there’s low
hanging fruit for people to contribute and still make our docs better um whatever that arrangement looks like
um so whenever we have done um sprint participation uh since then i think guys can maybe talk about the last
sprint that we did um this is something that we specifically focus on so that we can get people sunk into open source and what it
means yeah so that’s what the the most
powerful actions that we’re trying to do as well that we did in the last trimester and they were going strong again there
is making sure that we we in um we increase diversity and inclusion in
caterer collaboration so we participated on uh the europait on springs so we had
a weekend where they’re sitting the entire weekend uh helping people making their first pr and taking
basically taking them by the hand and showing them exactly what are the steps and
sharing to them and showing how to use schedule and demoing one-to-one um and it was so
awesome to see to interact with them and you had to actually receive there was
one that wrote this defto blog post uh saying that yet when i we were both
he’s his knight on shining armors because we’re just helping him getting
through the battle of making prs it was like it was adorable
um but yeah so and we’re going strong as well again with the sprints we’re participating
with on pi data global now next month with another open spring then we’re part of
um tycoon india as well with more sprints and there is hacktoberfest next month as
well that we also have mentored sprints so we’re trying we’re trying as much as we can to get everyone to try to try pedro and
to get those new those newbies to do their first beard um then if i think i don’t know if i can
but if i could i would like to talk about the uh some initiatives that quantum black also has on inclusion diversity
uh because they also help us um on that so a few weeks ago i think we
had the we had an initiative called uh codefest girls that we organized was
like there was a kind of meetup like all social distance of social distance of course
uh but we had two of our female software engineers going live and talking to them and
talking to girls to young girls and telling them how is the the journey throughout it
and how is to be how is to work with data science how is to work with analytics answering questions
and i think that’s so important as well because there’s so many so many girls that want to get into data science but
since well it’s still a very male-dominated field uh they still feel like they can’t um so
it’s it’s it’s amazing to have to be able to to have the opportunity to be part of that
um and also there’s one more thing this in qb every month we have this uh diversion
and inclusion event that is basically like having discussions about subjects such as
um female presence in night heroes impostor syndrome uh privilege and they actually got
people to come to go on the spot and to tell their journey to talk to talk to us and to tell us exactly how how they got
to where they are today and well especially for me that i’m starting it’s it’s really really good to have that
experience well that all sounds wonderful and i think sounds like a real
uh goal for other open source communities to achieve you know like
being welcome and welcoming and um just having newcomers feel safe and and respected in
the community i think that’s a an amazing goal to have yeah you were doing amazing community building
like i it feels like it’s really robust and so welcoming i’m curious actually i have
like a follow-up question to this um you were mentioning that you did some research on like sort of best practices of how to be
kind uh is that is that available in your documentation anywhere um that other projects could like be
inspired by it perhaps i i would definitely like to see that as well if you have that i could actually just
literally copy paste the document into the documentation like i i don’t think yeah or maybe it’s like a github issue
that we’ve linked on how we think about like community management as a whole but yeah i’m happy to share that because it was a
similar essentially summary of like everyone’s amazing um how to do this
um and that was what that document became yeah sure yeah yeah i mean i just feel like it sounds
so i mean your community really seems very vibrant and i would love to learn a little bit more
about the practices you all use definitely i would definitely copy that
model for numpy exactly like how do i use this in my
communities [Laughter] the best spirit of open source that’s
what it all means that’s it yeah exactly
right so i guess it’s time for us to go into the project demo uh we’ll get to see some of the cool
features of pedro and how it works uh so i would like to ask is you are you
getting ready so while utility is getting set up uh we
would like to take this opportunity to thank our sponsor quanside for sponsoring this episode of open
source directions one side creating value from data
so whenever you’re ready feel free
i’m gonna be able to share the correct window hold on a second
ah sorry yeah it’s no problem it’s just life as
it is now sharing screens and muting and unmuting yourself it’s just like
there’s so many times i forget to unmute and then i just see myself silently talking to myself that’s constantly constant that’s so
true recording um okay there’s
some weird uh i want a moment
let me actually see if i can start with um kedro this and the meanwhile
uh okay it seems like my sharing commissions are strange on my computer let me see if i
can try to resolve that you guys can just talk over me while i do that quickly
so maybe lace do you want to say something about the project or um
so let’s see i can share with you something that i’ve been doing now that i’ve been working for the last few weeks that is this i’m working on a
tutorial now an advanced tutorial uh on integrating pedro and great
expectations i don’t know if you’re familiar with great expectations um
a tam is actually 10 there is over there in the chat it’s actually helped me a lot with this
uh that it’s basically so he built a plugin um called kedrobrich that integrates the
both of them and just makes super super easy to use both of them together but i didn’t want to use this plugin i wanted to do it
by hand so he’s been using he’s been helping me with that and i would love for anyone that wants
to try it out just send me a message on twitter and i would love to just share it with you it’s open
on my on my github page um but yeah i would love to have some feedback
and i think yeto is ready now i am indeed we’re good to go um so what
i’m going to um what i’m going to show you now is essentially how you can actually access a demo for kidro
quite easily um with this one you’ll see that there’s a virtual environment activated i we use conda you can use
whatever you want um and we really got like um kedro installed as well
um so i really pip installed kidro so i’m just going to jump straight into this whole concept of creating a project and then
actually walk you through the code base and what that looks like so over here what you see over here is a cli command um that says cadre new
which means it’s a new project um and we’ve got this thing called starter over here um which is our
you can think of them like wordpress templates you know um how when i need to create a new wordpress blog i could choose like whatever template i’d want
to set my blog up with um but in this case kidro actually supports being able to do that with project templates that you’d use for
your analytics code um so you can choose ones with examples you can choose maybe uh set up one for an aws setup if you want
or froze your um whatever flavor you want on it and this one just has a simple example that
uses the iris dataset example in it so i’m going to just press enter here um
and we’re going to get some interactive prompts um which will actually walk us through the whole thing of setting up a project
so um it asks me to enter in a human readable name for my project i’m going to say demo
open source directions actually it’s cool with that
yes we’re a demo now um and i’m gonna i’m gonna actually just
limit this screen view i’m just going to press enter because i’m going to accept the default name
um and then we have a new project created so i’m actually just going to open up
this folder over here and now you actually get to see the start of the caterer project template so we see we have a place to
include configuration configuration in kendrick world means um how do i keep my hard-coded file paths
for loading and saving data out of my code base um and in some way that it’s it’s is
completely changeable how do i keep my parameters as well outside of my codebase making it easier for me to experiment
but also a place to have a single um a single place for me to actually just
be able to control my data science experiments so we’ll go through configuration in more detail data is essentially a place for me to
store data but remember we don’t remember uh we don’t uh we don’t recommend committing data to
git but you will see that there’s a certain folder structure that’s present in the data folder
from raw data all the way to reporting data and this is just a workflow that was recommended in
quantum black for how you think of processing data at different stages what it does it allows you to work with
your teammates quite easily because you know at every single stage okay let’s say for the um intermediate
stage that we’ve only really cleaned up like the column types in that column so in in
that layer um so i we shouldn’t expect to see any other type of data transformation in that layer
which is really good for reproducibility if you need to go back because you’ve made a mistake but in this case um you will see that
we’ve got the iris dataset example embedded in it because this came with this project input
so close that there’s a space for you to include documentation um so if you use uh google doc strings
in your code base we have sphinx integration that will automatically create documentation so that your code
um is is well documented data science code so everyone knows what’s going on um i’ll briefly show you logs which he
essentially um uses the python logging library um so you have a record of like what’s
happening in your kid where runs um la is touched on the fact that we do we do have integration
with um jupiter notebooks to support kind of like an explorer well i guess there’s maybe three reasons why you
would use jupiter notebooks in caterer world explore the exploratory data analysis
because it’s really great for that and you know that whole initial workflow we’d also perhaps use it for maybe
creating those python functions that you need before you move those into python script and then the third thing is that you
might also support a notebook for reporting and presentation at the end once you have your workflow
but it we believe like everything else should be in python script because there’s so many benefits that you get from that
and then we’ve got source which is essentially your python source code if i open up open source directions
you’ll see a python package there um and pipelines um we’ll be walking through data engineering and data
science pipelines here so what i’m actually going to do over here is go a bit deeper into each one of
the folders so you can see what’s what’s in each one of them and then we’re going to do a kedra run because kendra run now works out of the box
um and you’re gonna you you’re gonna essentially walk through your first uh kid or pipeline run with us you will see a range of supporting
hidden files um that are referenced here for instance isort um is used when we do when we have a
ketter lint command um so that you can link your code um and you will see things like um
our cli um which essentially are your command line interface which has um commands that you can add to the
caterer run command so let’s go into configuration and let’s go into um base so you’ll see a range of like
boilerplate which explains what’s exactly happening in the data catalog so this is actually where we talk about you being able to
specify file paths in configuration so um kedra uses what we call our data
catalog which is our series of data connectors you’ll see that we support many different file types here
so in this case the iris data set was in csv so we’ll just load it with using the
pandas api um and just say we’re loading it with the pandas csv you’ll see that we actually specified
where this folder this data is going to be loading from and it’s a relative file path to this
file over here so data one role and iris that we’re going to be
loading i mentioned that configuration also has support for parameters as well so when we go through the data science
experiment you’ll see that we’ve actually referenced our different parameters that you would have used in the different in the setup
right here for you so i’m going to close that you will see there’s an additional
folder called local that’s essentially where you keep your secrets so if you’ve got any credentials or
you’ve got any configuration specific to your ide um you keep it in here because it’s get ignored
um so it means that no one else is really accessing those things remember it’s not like best practice um to see um credentials in
a project um later to your comment about seeing data in the project you were shocked
it’s just the way that this example is set up in real practice people normally use files
basically cloud storage because they’re using s3 or they’re using azure blob storage so you’ll never
actually find any data populated in that folder at all um but the folder itself the contents of
the folder still get ignored um because that stuff shouldn’t be committed to get ideally so i’m going to close that um dogs
follow i might actually run through that command if we have enough time um but the next one that’s most interesting and probably
the most is like the bread and butter of of kedro is essentially how we think about constructing a pipeline
all you need to know is how to write python functions so i’m going to open up the data engineering pipeline and we’re
going to have a look at a folder called notes so here’s where i actually introduce a concept to you in kedroland
a node is a python wrapper that has space for an input and an output
and you actually see it when we construct the pipeline together how that actually works but you see in the nodes folder that all
we do is essentially specify a python python function that’s it this is all
you need um to actually get rolling and kedra when i actually open up the pipeline.pi file
we actually now get you now get introduced to our pipeline extraction in itself so this node takes in that python
function called split data okay its inputs are the example iris
data that we had referenced in our data catalog so in configuration that iris data set we also take in
there’s parameters that we were talking about in this case it’s a test data ratio split
and the outputs of this are a train and test split for x and y that’s all we do with this
very simple pipeline if i have a look at the data science workflow as well
you’ll see some more python functions under nodes so one that trains models
another one that predicts and another one that reports on accuracy
and when we have a look at the pipeline for this one you’ll see so we’ve got one node right here so it
took in the python function for training model and it had certain inputs specified here
our example basically our train data sets for both x and y and a series of parameters that were
needed for those um data sets to work and we output an example model
this example model is actually taken in as an input for the predict node so we now do
the predict function i’m taking the example model taken another the test one of the test
sets for x and have example predictions come out and then when we want to report on the
accuracy of this we have another python function report accuracy we saw in the previous version
and we take in the example predictions we take in our test um the y um test y split and we don’t
really have an output it’s none because we’re just going to be checking that from logs so let me actually show you what it
looks like from code um or in terms of the actual um locks for this one i’m
going to do a cable run oh wait i need to change into the
project directory um open source directions
cool and now i’m going to do a quick run
oh so now we see some logs um that have now put it here um we see that in our data catalog
we loaded in that example ios data set from configuration it was that csv data set that we needed
um and we loaded all sorts of parameters that we needed for this experiment to run but the only part that’s perhaps useful
to you is this one the model accuracy is 100
um which is essentially like how this pipeline runs it’s a very silly one
um one additional thing that i might want to show you now um which is this pipeline is actually pretty simple um
we we try to stretch it as much as possible to show as much detail as i can um but i can’t show you what it
looks like when we talk about more complex pipeline visualizations so i’m going to open up a kendrives
pipeline visualization that’s kind of like the last step um in this workflow and you kind of see
while kendra has applications on kind of like more simple workflows um it
definitely has um space um to work for projects that have
a thousand notes and ten people teams working on it but it’s also still good for applications where it’s just you
and it’s just your university project and you just want to make sure that you have a reproducible workflow
so if we look at this um what you’re seeing over here is an example you could say an example
retail application pipeline um so if i actually filter down to the data engineering side of it when i refer to
this process of data engineering purely the data processing stages cleaning data transforming and creating
features so we take in some shopper data um we load it we do some
data cleaning in the intermediate layer remember we spoke about that whole thing of the layered workflow
um and then eventually we create some features here at the end if i have a look at the data science
pipeline for this um let’s actually break it down a little bit further
um we have some model explanations that are done so we take in some features we want to use some form of explainable
ai on this pipeline we we work with that we also don’t want to implement some form of performance
monitoring as well we do some model training obviously because we need to put some outputs out
i don’t know this pipeline is really random you have some optimization steps too and then we have a reporting layer where
the outputs land up with dashboards and this is essentially a tool that you can use for communicating with different
teammates on your teams like maybe um non-technical teammates that still
need to understand like how the data pipeline works because you can have a conversation with them at this level um about what
um your pipeline is doing and don’t have to try and scare them by showing them code um for them to get what’s going on um
we find that kid reviews is used that way um but some teams will actually use it to onboard um team members on to how their their pipeline
is um structured so yeah that is essentially kedro in a nutshell
um with you looking at pipeline visualization um you’ll be able to find this online um if you have any more questions
um that was awesome thank you so much i think it was extremely polished and
like i said in the chat you can see the love that went into the design of this thing like you can see that it’s pretty simple to
use and pre-human uh in the interface and everything i i found it awesome
um so just quickly i don’t know if any of you want to discuss a bit about the road map
and like broadly speaking where does the project go from here
cool um so with this one as well and i also saw some questions about desert dusk um we have like one we spoke about the
data connectors right and how extent they’re very extensible so you will find a das data set embedded in there if you
want to create more um in the set go for it so definitely go for it in terms of contributions um and then we mentioned
that kendra’s place fits in the space of how do i get stuff to deploy how do i get good quality code high
quality code that i’m proud of that is deployable so when we look at like how you schedule runs on different systems we actually
leave that flexibility to you but you’ll you will see things like we even support a kedro airflow plug-in which converts your kedrop pipeline into
an airflow dac so you can take advantage of airflow’s amazing ability to do good orchestration and
scheduling for you now in terms of in terms of the road map um you will see that we’ve been
expanding um the how we think about hooks um in kedrow
as the concept is kind of borrowed from react but it allows for more extensibility across the kidro framework
for you to plug into the different parts of the framework um you will see that we are um adding new things to this so really building on
that whole thing of people using uh kendrick is to talk to their technical teammates about how their pipeline is structured
so look out for this amazing like side panel when it’s eventually built that will actually show you your code for your different workflows and also
what configuration you were using in the site um we have another
internal product that will be renamed i’m not even going to give you the the name because like you know we we’re
batted names when it’s on the inside but when it’s on the outside the name will be fixed um and this
um this product specifically helps us with the concept of experiment tracking um where we look at like i know i’m a
data scientist i used a random forest model here um i used these parameters to make this random forest model
it was maybe it had an accuracy of 92 but then i did things and then it was 67 and i wanted to go back instead of me
taking notes somewhere um i actually just use something that has logged these things for me so it’s easy for me to go back and revisit my
old workflow and that’s essentially where this functionality fits um and we will be releasing it um in some form um on kedro
as well um and then we talk about um work with great expectations we have another um internal plug-in
great expectations is this amazing library that does data validation so think of it as this whole thing of
when i built my data pipeline um i was using a table that was on aws s3 it was in some
test environment and it had eight columns that i needed to run my pipeline when i deployed it in production
the table had six columns because someone had removed columns from the data set how do i know that my pipeline is
failing because it’s a data error and that’s essentially where great expectations place role because it will essentially tell you
it’s failing because these two columns are missing um and you know exactly where to go and fix the era so that’s actually where if you guys
haven’t checked our great expectations it’s a great project um we have an internal plug-in that’s being dog fooded internally
that we hope to release as well um so you can look out for that um and then i guess with all of these
things like we’re looking at how we further position ourselves in the open source data science community because
i think someone um waylon actually says so one of our really key users he’s like
um we’re caterers in a space where everyone needs a framework for how they work but people don’t know that they need a
framework yet and they’re searching for kedros so really how do we get awareness out in that space is like really really
important to us so yeah you should see very exciting things um coming on the keterwind
and yeah we’re very excited for the
future so i don’t know if please wants to say something
oh sorry i was aware of time i was a little bit of a little bit aware of time now
um yeah so i think i think he took over all the covert
all the the expansions that like all the roadmap that the that we’re trying to develop the
future for for uh casual for now uh i
guess that well i’m i’m not really developing things that much like i know where we’re going but i
guess um i would love to know from the community if i could give like a shout a shout out to everyone i
would love to know from the community which kind of uh tutorials you’d like to see which kind what is it that everyone is
looking for because we’re trying to make kedro more and more accessible for everyone and there is a lot of love
putting put into making that that framework and we we absolutely love kedro um
and we want the community to love ketchup as well so help us so would you like them to file
an issue if they if they like want to add something is that is that your preferred pathway
yes please do yes and uh
so going over the questions i guess we have only time for one question but it goes
fits really nicely into what you were saying because the question is um do you have a page with some specific
examples uh if you have a problem and how do you solve that with kedro i think that’s a great idea for a
tutorial actually yeah yes i agree go ahead
put it onto the backlog but not like you know when pm say that they normally mean that it’s not gonna be done um but this one is actually a ticket on
our backlog but it’s supposed to be done in the spring so like we will we will have that page up for you guys
and we’ll share the link with you awesome so i guess uh we’re coming
to the end of the episode so we get our rent or raid section which is uh each pers each
person gets a 15 second soapbox to rent or rave about whatever topic um you don’t you go first
um i’m ranting um about the end of summer i’m i’m very worried
that there’s no there’s as a south african when i see this encroaching
nighttime and it’s going to be dark at 4 pm i get very nervous and it’s coming for me
life i’m going to be insensitive and i’m going to rent about kovid
oh that’s okay i’m gonna rant about social distancing and how all human contact contact have
been having for the last three months so uh remote and zoom based that’s what
i want to rent about it’s it’s it’s making my days not as great as they could be but we’ll survive it’s all right
mandy ken okay i’m gonna i’m gonna rave about how wood chucks run so woodchuck
is like a small you know forest animal they have there’s some that live outside my house
um and so sometimes i see them like walking around they don’t really like humans so when they notice me they run and
they’re like a little blob of fur they sort of look like an otter but running a really chunky otter
and when they run it’s extremely cute so um check them out they’re hilarious and amazing
so my rave is going to be the opposite of it on tundes because it’s starting to get hot in here in brazil so
it’s almost spring so the time the weather is nice and it’s starting to be warmer so i’m happy about
that i think we can maybe enjoy some time outside now that things are getting a little better
wow you’re rubbing it in for us northern hemisphere people i’m sorry [Laughter]
and food and now we just need to come and visit yeah this is like a sandwich of making us
jealous about brazil i feel like i can all be very welcome here
i can’t back you up on this one though i could back you up on the food one but this one no i’m in the island
there is rain every day even when the forecast says there’s gonna 22 celsius that’s our summer
it’s rainy and it’s cold well i’m so sorry for you but that’s how
things go unfortunately so but i’m in brazil so
i’m okay [Laughter] anyway um that’s all the time we have
for today and i thank you all so much for watching for listening and also for participating
it only in liaise that was awesome you can find us on twitter at open teams inc and at quonsite ai
uh yet only where can people find you and kedro um so kedro um easily accessible if you
search on github for us um you’ll find um everything related to the project there um if you ask me to ask more questions
head over to stackoverflow um and definitely do that but otherwise you’ll find us on twitter as well i’m
there at youtube.uh i as i ask questions i harass users in the nicest
way to learn for feedback learn feedback um about the project and i’m pretty sure
you can also find lice um that way as well ladies underscore bsd
yes indeed so like underscore psc and we posted both of them on the chat and yet yet
uh we’re both on twitter we spend a lot a little time on twitter so you can find us there you can find us on linkedin
you can find us on github you can find us under the maintainers and collaborators on cadre
the cadre github page um if you google our names i’m pretty sure we’re gonna be there somewhere
as well so yeah just send us dms just say come and say hello we love talking to users all over the
internet so if you liked what you saw today please go to your our
youtube channel and like and subscribe to see more of this content we look forward to you joining us next
episode uh so you can drop in for a discussion on
drupal sounds great