Kedro Open Source Development

September 11, 2020
9:00 am

About

In this episode of Open Source Directions, we were joined by Yetunde Dada talked about the work being done with Kedro. Kedro is an open-source Python framework that applies software engineering best practices to data and machine-learning pipelines. You can use it, for example, to optimize the process of taking a machine learning model into a production environment. You can use Kedro to organize a single-user project running on a local environment, or collaborate in a team on an enterprise-level project. For the source code, take a look at the Kedro repository on Github.

Do you use open source software?

Find trust open source experts for any project or question today: https://openteams.com/

Transcript

hello the internet welcome to open source directions hosted by open teams the first business to

business marketplace for open source services open source directions is the webinar

that brings you all of the news about the future of your favorite open source projects i’m a mathematician

my name is melissa and i’m a software engineer based in brazil working with open source software and

communities at quan site and i’ll be your host for open source directions today

co-hosting with me today is hi emma deacon and i’m really excited to be

co-hosting for this episode of open source directions i’m a postdoc at the university of illinois at urbana-champaign

i work on an open source project called the yt project i’m based in urbana illinois

i should perhaps go next hi everyone uh my name is yatunde and i’m the product manager at quantum

block part of mckinsey and company um quantum black is kind of an advanced analytics company that was acquired by mckinsey a

few years ago and what we do within my unit is that we help build products for data engineers and data

scientists including one called kedro that we’re going to be talking about um it’s a python library that makes it easy to build production-ready

data pipelines um kendra is also mckinsey and quantum black’s first open source product which is pretty cool um we’ll probably

get into um that one um in my past you’ll find that i’m an mba graduate from oxford

and my background includes being a data product manager in the banking industry and i’ve also worked a lot in the

non-profit space but if we go way back i’m actually a mechanical engineer by degree and through work um and i’m

actually from south africa so if any south africans online say hello i’m based in london now

oh cool so well hello everyone um first of all it’s a pleasure to be here thank you very much for having us

uh my name is and i’m developer advocate for quantum black uh so i don’t need to introduce quantum

black as yet already did that amazing job um so my background is in

civil and environmental engineering and in customer service um i work with some

python i’m a python activist i volunteer a lot for it for the python course uh and i’m also

an undergrad and only recently started building a community around cadro uh i’m originally from brazil and but

right now i’ve been living in dublin ireland

thank you so much we’re so happy to have uh italy and nais today with us

but first before we start we will go to our famous tweet of the week section where

each of our panelists will present a tweet that they have been enjoying recently your first app sure um so i’m actually

gonna show you two we’ll talk about two um because i spend a lot of time on twitter but

quite a bit is it says friday fun fact in 1974 ramesses the second was sent on

a flight to paris for preservation and maintenance work so you know ramesses second was a famous egyptian um

but since french law requires every person living or dead to fly with a valid passport egypt was forced to issue a passport to

the pharaoh three thousand years after his death and the actual tweet includes like an image of his passport

um of what was used um but the second tweet that made me laugh was that um our product’s name kedrow um we do have

an active community of like uh people that tweet about kendra all the time but it’s also a person’s name um so when i saw a tweet that said

kedra probably um probably gotten only fans um i was i was laughing a little bit about that

naming things is hard so yeah life your first you’re you’re next

yes and so uh for me the tweet of the week was this one that i saw i don’t know if uh everyone is

familiar with the new uh christopher nolan movie called tennis uh like it’s it’s super confusing

apparently so the tweet of the week says uh if you’re confused about recursion

uh go watch tennis uh you’re gonna be even more confused but at least you have seen a good movie

that’s awesome um i’ll share mine uh this is something that made me laugh it’s a video of a guy who was uh playing

the piano and singing about 168 aws services in two minutes

it’s hilarious he’ll just go mention every service he can and then he edited the video to have the

actual logos of each project showing up as he sings them and it’s just great you should check it out

i love it all of these are so good um and i love the passport photo by the way uh it’s amazing

um so mine is uh is a link to a profile but it’s really like active this month

and it’s the mineral cup so this is a like contest that’s going on on twitter where people debate over their favorite

minerals and so slowly it’ll like debate down to a final match between two minerals

so you get to see some like very and so people vote on their favorite mineral in each matchup and you get to see some

amazing science communication but also some extremely nerdy discussion justifying why somebody might like

you know uranite over another uh cool mineral so i obviously vote for the

radioactive ones but mine are not winning so anyway check it

out it’s really cool so that’s great um yeah so thanks for that

and i think we can go jump into our main subject which is kendro

kedro is an open source python framework that applies software engineering best practice to

data and machine learning pipelines and helps optimize the process of taking a machine learning model

into a production environment it has about 3 000 stars on github

and it has also about 70 000 downloads a month across pipi and conda which is uh super

awesome um and i will be happy to hear about it today

yeah so starting off i’m really curious if one of you could tell me who started it why it was started and

what need it fills sure um so kendra has been around within quantum black for the last two and a

half years um we only open source at some time last year in june but it was originally designed um

by um aries machine learning engineers within

quantum black to solve uh problems that they were facing when they were working on project work um with a different client so you know

our model is we’re part of mckinsey we deliver advanced analytics solutions to clients and they ran into some

problems when it came to how they chose to collaborate with each other when they were trying to produce

kind of like production code or you can think of it as machine learning products that are

functioning um but following that um that version of uh kedro back then it was called carbon ai

um was actually um redesigned and rebuilt um and it was rebuilt as an internal

product by team that includes like materials errands even donov nicholas um khalsa

um mesam and nicholas nicos again was back on that team um so we find that

um the reason it was actually rebuilt as an internal product was because quite a few teams actually found that they had the same problems that the original

designers had um we think of it as kind of your way of um what because i’ll actually tell you what

our users say um the need it sells for them standardized workflows so this comes

about because when you’re working in kind of like the enterprise data science space you’re working in a company that

produces code as the final deliverable of what should be happening if you’re not working in a kind of like

standard way with your teammates things become like kind of hard to keep track of like what’s going on you choose

to set up your project in your own way handovers become terrible because you have to go and like find the person who

actually wrote the code base like forever ago try get them to explain to you what was going on and then they also problems

around collaboration we obviously talk about use of jupiter notebooks primarily for data science work

but when you’re writing when you’re creating a machine learning product um i mean for instance it’s even hard for

two people to work on the same jupiter notebook at a time instead of working in python script so caterer kind of solves a lot of

issues around this whole thing of standardized workflows um and then making it easy for us to

collaborate while we build this great software we’ll actually build data science code that needs to be turned into software

please do you want to complement that uh well that was very very summarized

i just wanted to say that um well that is what we’re trying to do uh i wanted to say that we’re not trying

to extinct jupiter notebooks uh we’re trying to we’re trying to integrate

uh we’re just trying to make uh the workflow everyone work in a bet with a better workflow we’re just trying to

enable communication between um in teams and between teams as well

and trying to make sure that everyone works in a streamlined way and um well produces very good

uh production ready code right from the start that’s awesome so can you explain the

history of the name and the logo for kedrow we’ve had so many names um because

naming things is hard leaving things is hard um the original name of um kendra was actually carbon ai

then it became kernel ai that was all internal before we open source i mean like we even tried to drop the ai

so it was just kernel but then we couldn’t call it kernel because there’s many things called many things called kernel right um so uh

the process of renaming kindred just before open source um meant that we went through i think at

one point the team generated 100 names um but had to meet certain criteria before we actually got down to the top

five and kidro was the winning one from that one i can i can say kendra was almost called burano

at one point um but i’m glad we said alan kidrow in the end

um in terms of how we think of the logo as well um we have a very like we have an incredible design unit

um within quantum black um and they decided to prioritize shapes um for how the kidner logo was was built um kendra

was one of a few um internal products that have this kind of like shape based very distinct um logo branding so that’s

actually what we went with and yeah and then the the name

uh it’s a metaphor a greek metaphor for a core so it means the center of the

earth so uh the way we see it is schedule is the center of your analytics project

and yeah it ties up very well and i love that when yesterday said about the was talking

about the logo our lead designer just showed up on the on the chat as well so gabriel thank you

very much for the work that you put in in the design um we love you and he’s also the dj he’s

the one that has been making the kedro the creator playlists you have a pedro playlist that’s amazing

edition 2 was launched today it’s the friday gift for all the releases we’ve been doing this week okay this is

an amazing process i feel like other open source projects should have playlists i would like to know for example what

the jupiter playlist is um okay so can either you describe to me

how you differentiate yourself from alternative projects out there

um so i think this is actually where we get to like the actual intent of why cather’s around um we think about it as standardized

workflows we think about it if collaborative data science we focus on the problem of

how do we write code data science code that is deployable how do we work together so that it’s

high standard well-tested code and it’s a proper machine learning product before we deploy it

but everyone gets really i think they get really excited when they hear that we have a pipeline abstraction um in in

kidron they’re like oh my gosh it’s like a pipelining tool and we’re like not quite um because the other pipelining tools that exist in industry

have a different focus they decide to prioritize this whole thing of how do we think about orchestration

and scheduling so how do we if we know that we have some code that we want to run um how do we make sure that it’s running

on sunday at 5am and if one database fails to run like it will retry until like it’s successful

um so we see tools in that range of like airflow luigi dagster

and prefect which focus on that problem specifically where we focus a full step behind that

which is like we know that you need to do some sort of experimentation while writing this production code

um how do you work in a structured way while you get it so that you have something that is deployable when it’s time to move um

the previous tools don’t actually look at the the back end of that which is they just assume that you have something

that’s that’s deployable yeah and i think it’s it’s interesting

to to say as well that like um we we try to

cover everything on your project so we try we go from the exploratory phase on your jupiter

notebooks we have on our template we have a folder with notebooks that all your notebooks go there and

there’s like a little bit of a tweak that you can do and use your cli as well and you can convert that notebook straight into nodes and

they go straight into your your workflow and we have testing them there as well

like we’ve done everything that you can think of that you you need uh when you’re buying you’re

when you’re building your data pipelines uh like it’s it’s from a to z

and then we have a little bit more on the on the back as well because there’s other also casual hooks that you can hook in anything that you need

uh and make your your project even bigger and expand it with more easiness i think

maybe because there’s one more point about the hooks thing um that i could actually mention um this has

actually come from our users in the open source world they were the first ones to actually tell us this and we’re like oh this is interesting

um but they call this kind of like the django for data scientist for data science data science or the

react for data science so we come into space where there are no frameworks for how data scientists

should collaborate together and create great code um so we’re among the first

in the space and we’re actually fulfilling a lot of needs around standardization of like how those

workflows look so sometimes i think to some users we do appear like crazy because everyone’s like why do i need this thing um but then

they run into so many issues which kedra solves for and then they eventually come around and they’re like actually we get why this thing exists um because

we run in we try to what what what actually this is actually the user journey it’s really cool

you run into issues then you try to build your own framework to try and solve it and then you look around and you realize

oh wait there’s this kind of thing that has everything um already and then they eventually pick up pedro um which is quite cool to see

uh so i just saw someone commenting on the chat as well about uh the visualization so i think it’s

also worth to mention um that we also have kedrovis there is our specific visualization tool

that it’s based on cli that you can just type casual fees and see exactly what’s going on in

your pipelines and we have actually a few users so they use that feature to debug their code so they

look at what’s what how they actually abuse they structure they’re like okay wait a second we’re building this big

system in our head no no no i need to just type casualties on the cli and see what’s going on with why

the code i’m writing so it’s yeah oh i can definitely see why you have so

many fans it sounds awesome and i i definitely think there’s a need for it

uh i just want to remind the audience uh that you can ask questions uh at the livestorm app and then we’ll

try to answer those in the q a section uh at the end so please type your questions

in and we’ll try to answer them later so yeah i think you discussed this a bit but what technology

is schedule built on like can you describe a bit uh how that’s done

cool um so we’re actually completely um python like it’s a completely python library

and we actually used to use a makefile within the library template to actually do our cli commands

and everything because we okay let me actually talk about the components of kedrow we talk about a project template um

generated by cookie cutter data science but modified to be like kind of like with the best practice of all the teams that have ever used kedra

before we’ve got 170 client projects that have used kedrow so it’s across industries and we’ve

basically built that feedback in we talk about a data catalog which is um kind of like our series of like data

connectors to connect to any data source local um cloud storage

um hadoop file systems if you’re using a pi spark workflow um and that uses um either the python or

yaml api as well so there is support for polyamol inside of kedro at least for the data catalog

um we talked about the um from there the pipeline abstraction

it’s just purely python and then we talk about kedroviz which is our pipeline visualization tool built

purely on top of like the pipeline script um and that is a react app um you can think

of that that works quite nicely it opens up a localhost server and you can visualize your

pipeline as you work um and that’s essentially the components of kedro we do support other things

i think if you’re deploying caterer projects there’s a lot of flexibility to do that in many ways and think of it it’s part

of our business model we’re consulting we work with whatever the client has so

we obviously have to deploy in many ways um but you can check out the kedro docker plugin

which will package your kit or project into a docker container so you’ve got everything your whole project packaged

in there um and yeah there are a few other options like that

i don’t really have anything to add to this on the on the sensor because well basically

that qp is a python house and i learned that on my first day and

that’s what we’re sticking with

um that’s perfect thank you uh for describing that i know we’ve talked a little bit about who started kedrow can um can you tell

me a little bit about who maintains the project because you’ve been using a lot of wiis uh like we are doing this right so i’m very

curious about who is maintaining it yeah there is there is no there’s no ways we actually have like a

large team um that actually sit behind kedrow um so i’m product manager um evan donov is the

ketter tech lead um on the kind of like python core side of the library and

some of the primary components there we’ve got richard um with tanner who is the tech lead on the pipeline

visualization tool he’s a front-end engineer um dimitri derriban derrabin who is a software engineer

lauren poland software engineer um and then we’ve got

lim um huang um who’s also an incredible um software engineer in the team um andriy ivanyuk another software

engineer um and we’ve got joe stitchberry who i guess you guys have been raving about our documentation

um you can thank her for that um we’ve got laissez um up as our developer advocate kind of

like as our interface between um kedrosa library and our open source community

with all our users um and then we’ve got meryl thiessen as well who’s another one of our software engineers and she comes in

um as we introduce like kind of like new features when we talk about the roadmap um i’ll be able to go into that one a bit more

um and that’s essentially the kid routine we’re family um and yeah it’s it’s been it’s a an

absolute joy um working with the group yeah like so i’m quite new to the team i

started working keep could be like around three or four months ago and the thing for me that it was was very

obvious to see right from like i think first week was how meticulous about code quality and

specifications everyone is like there is this very high standard of communication very high standard of

documentation internally as well and it’s it’s amazing to be able to be

part of a team like that because everyone is super nice especially with the newbie

here that breaks circle ci um checks that breaks github sometimes

as well yeah but uh yeah it’s been it’s been

really really cool and the team is absolutely amazing i i couldn’t ask work on a better team

well if you haven’t broken the ci you’re not really working right yeah i feel like all of us have broken

the ci at some point you see dimitri one of the the older

software engineers behind pedro told me when i submitted one of the when i merged one of the commits that wasn’t

supposed to be merged at all and i just went there and clicked merge squash

his question merged and i went back to him and i was like i am so sorry he was like don’t do that

again but again if you never broke get help um you’re not a developer like okay

i take that with me for life that’s it so can you describe a bit about uh what

communities and users are your contributors from so sure i can actually talk about even i

guess our support model as well so um in terms of like how the

okay so how cater team like maintains a repo is we have a you will note every week it’s kind of

like the same person who will be commenting on like github issues and pull requests and accepting pull requests and helping users through pull

requests because we have like a role a rotating role in the team called the ketter wizard and it’s your job to make sure that

everyone has a great time i’m coming to the repo you get your answers question your questions answered um and if you’re

posting stuff from stack overflow because we’re stuck we’ll be looking at those ones as well but the really cool

thing about this is that our network has grown beyond us um and beyond the kidro team just doing these so we do have like superstar

users across the web um you’ll see waylon walker is quite active blogging and answering

people’s questions um you’ll see um tam who is also known as data new engineer one if you ever check go google

that on youtube to actually see like an entire stack of like amazing videos about

introductory videos about kendra but he also publishes plugins about kedro um and then answers user questions um all the way to

like um well i guess maybe i think maybe one of my favorite communities is actually the kaggle um japan um

community where canada has been used by the grand masters of japan because it solves a

problem around reproducible workflows i actually see time is online oh my goodness um so um you you can

actually you can actually check that one out so um growing community of users um different use cases you’ll find that

kitters also use in academia as well um because you know there’s problems around reproducible data science

for papers that are published so students pick it up as their very real choice for making those things happen

but of course we know um it did evolve out of industry applications um so it’s been really really cool i

think laissez has been the pioneer of this work um as we connect with different companies that are picking up

um kedra across the globe um and we i think we’ve covered like over 200

users um just in the separate companies um but within quantum black and mckinsey we

also have hundreds of hundreds of users as well that we have access to so lots of places to get user feedback from

yeah like we have uh there is one of my i can’t say favorite because that’s

that’s too biased but one of the projects excited about

they have been using kedro is the open source um open source latin america um

that it’s basically this like huge community of coders data scientists and um data

engineers they have been working with public data from argentina and some other um countries in south america to find

out solutions to everyday issues so they use open data uh on a

non-profit organization they use schedule to find out solutions for those problems and that’s that’s

wonderful like it’s it’s such a big social impact it’s so awesome to have a user base

that is just so engaged with the community their own community there is another

continent and they message us on linkedin sometimes like yeah so we’re doing this and we’re so excited

and they’re like okay can you help we would like to talk to you because we need a little bit of help over here he’s like yes so everyone is

super engaged and everyone is trying to do the best they can all the time and to

make sure that we actually get to make sure that we actually get the the the results that we’re looking for because ketchup

is going to become the standard on data science and machine learning pipe data pipelines and that’s the code we’re

the we’re the react of data science that’s it i’m

loving this and also it sounds it’s amazing how vibrant both your user and developer

communities are i l it’s really fun to hear this from both of you um so okay i’m really curious is the

project participating in any diversity and inclusion efforts and if so what are they

sure i can maybe speak about how we do community management and then also how we’ve participated in sprints

um because those would be the the clearest ways to actually see it um we uh this actually like this would

be like beyond um before your time um but when we open source kid where we were worried about like

how do we actually present a good open source project before uh to to the world being mckinsey’s first open source

project these things are important um so there was one it was one weekend where i spent uh you know i think i must have trolled

the web looking through how people say what is best practice for community engagement and what should you do and what should you not do

um and that determined that made us a kind of like communication guideline for how we talk to our users

how we turn down pull requests that maybe aren’t aligned with the projects gracefully and kindly um and then also

how do we answer questions in the best way especially when we sometimes we’ve had a few trolls as well but we still deal

with them with the proper kindness and respect that everyone is due on the library as well so i think

in terms of like trying to create that environment across kedro we try to make it welcoming for

anyone who has to interact with us at this layer done in this way

um the second place would be we obviously we love participating in sprints so an example

of this one would be um maybe i work with the london python sprint group where um uh we had an organizer um chuck ho

um who came in and was like well we’re gonna have a lot of um newbies coming who have never

contributed to open source before um how will we support them with their journey of learning

um not firstly how to contribute um to open source and then therefore how to contribute to the cadre project

um so we had a great time um actually teaching people how to create their their pull requests for the first

time um because we know typically in this phase um black women and then um under represent or

minoritized ethnic people are very very underrepresented um so president was super exciting to see that

majority of the people now first time contributors we’re women we’re minoritized ethnic people um

participating in this project but it also means we do things like i will create um

github issues which have just a single fix for typo i mean it’s it’s faster for me to

actually fix it in the docs um but for me to go and like write a nice explanation of like

this is exactly what the change is and it’s just the type of fix is important to me because it means that there’s low

hanging fruit for people to contribute and still make our docs better um whatever that arrangement looks like

um so whenever we have done um sprint participation uh since then i think guys can maybe talk about the last

sprint that we did um this is something that we specifically focus on so that we can get people sunk into open source and what it

means yeah so that’s what the the most

powerful actions that we’re trying to do as well that we did in the last trimester and they were going strong again there

is making sure that we we in um we increase diversity and inclusion in

caterer collaboration so we participated on uh the europait on springs so we had

a weekend where they’re sitting the entire weekend uh helping people making their first pr and taking

basically taking them by the hand and showing them exactly what are the steps and

sharing to them and showing how to use schedule and demoing one-to-one um and it was so

awesome to see to interact with them and you had to actually receive there was

one that wrote this defto blog post uh saying that yet when i we were both

he’s his knight on shining armors because we’re just helping him getting

through the battle of making prs it was like it was adorable

um but yeah so and we’re going strong as well again with the sprints we’re participating

with on pi data global now next month with another open spring then we’re part of

um tycoon india as well with more sprints and there is hacktoberfest next month as

well that we also have mentored sprints so we’re trying we’re trying as much as we can to get everyone to try to try pedro and

to get those new those newbies to do their first beard um then if i think i don’t know if i can

but if i could i would like to talk about the uh some initiatives that quantum black also has on inclusion diversity

uh because they also help us um on that so a few weeks ago i think we

had the we had an initiative called uh codefest girls that we organized was

like there was a kind of meetup like all social distance of social distance of course

uh but we had two of our female software engineers going live and talking to them and

talking to girls to young girls and telling them how is the the journey throughout it

and how is to be how is to work with data science how is to work with analytics answering questions

and i think that’s so important as well because there’s so many so many girls that want to get into data science but

since well it’s still a very male-dominated field uh they still feel like they can’t um so

it’s it’s it’s amazing to have to be able to to have the opportunity to be part of that

um and also there’s one more thing this in qb every month we have this uh diversion

and inclusion event that is basically like having discussions about subjects such as

um female presence in night heroes impostor syndrome uh privilege and they actually got

people to come to go on the spot and to tell their journey to talk to talk to us and to tell us exactly how how they got

to where they are today and well especially for me that i’m starting it’s it’s really really good to have that

experience well that all sounds wonderful and i think sounds like a real

uh goal for other open source communities to achieve you know like

being welcome and welcoming and um just having newcomers feel safe and and respected in

the community i think that’s a an amazing goal to have yeah you were doing amazing community building

like i it feels like it’s really robust and so welcoming i’m curious actually i have

like a follow-up question to this um you were mentioning that you did some research on like sort of best practices of how to be

kind uh is that is that available in your documentation anywhere um that other projects could like be

inspired by it perhaps i i would definitely like to see that as well if you have that i could actually just

literally copy paste the document into the documentation like i i don’t think yeah or maybe it’s like a github issue

that we’ve linked on how we think about like community management as a whole but yeah i’m happy to share that because it was a

similar essentially summary of like everyone’s amazing um how to do this

um and that was what that document became yeah sure yeah yeah i mean i just feel like it sounds

so i mean your community really seems very vibrant and i would love to learn a little bit more

about the practices you all use definitely i would definitely copy that

model for numpy exactly like how do i use this in my

communities [Laughter] the best spirit of open source that’s

what it all means that’s it yeah exactly

right so i guess it’s time for us to go into the project demo uh we’ll get to see some of the cool

features of pedro and how it works uh so i would like to ask is you are you

getting ready so while utility is getting set up uh we

would like to take this opportunity to thank our sponsor quanside for sponsoring this episode of open

source directions one side creating value from data

so whenever you’re ready feel free

i’m gonna be able to share the correct window hold on a second

ah sorry yeah it’s no problem it’s just life as

it is now sharing screens and muting and unmuting yourself it’s just like

there’s so many times i forget to unmute and then i just see myself silently talking to myself that’s constantly constant that’s so

true recording um okay there’s

some weird uh i want a moment

let me actually see if i can start with um kedro this and the meanwhile

uh okay it seems like my sharing commissions are strange on my computer let me see if i

can try to resolve that you guys can just talk over me while i do that quickly

so maybe lace do you want to say something about the project or um

so let’s see i can share with you something that i’ve been doing now that i’ve been working for the last few weeks that is this i’m working on a

tutorial now an advanced tutorial uh on integrating pedro and great

expectations i don’t know if you’re familiar with great expectations um

a tam is actually 10 there is over there in the chat it’s actually helped me a lot with this

uh that it’s basically so he built a plugin um called kedrobrich that integrates the

both of them and just makes super super easy to use both of them together but i didn’t want to use this plugin i wanted to do it

by hand so he’s been using he’s been helping me with that and i would love for anyone that wants

to try it out just send me a message on twitter and i would love to just share it with you it’s open

on my on my github page um but yeah i would love to have some feedback

and i think yeto is ready now i am indeed we’re good to go um so what

i’m going to um what i’m going to show you now is essentially how you can actually access a demo for kidro

quite easily um with this one you’ll see that there’s a virtual environment activated i we use conda you can use

whatever you want um and we really got like um kedro installed as well

um so i really pip installed kidro so i’m just going to jump straight into this whole concept of creating a project and then

actually walk you through the code base and what that looks like so over here what you see over here is a cli command um that says cadre new

which means it’s a new project um and we’ve got this thing called starter over here um which is our

you can think of them like wordpress templates you know um how when i need to create a new wordpress blog i could choose like whatever template i’d want

to set my blog up with um but in this case kidro actually supports being able to do that with project templates that you’d use for

your analytics code um so you can choose ones with examples you can choose maybe uh set up one for an aws setup if you want

or froze your um whatever flavor you want on it and this one just has a simple example that

uses the iris dataset example in it so i’m going to just press enter here um

and we’re going to get some interactive prompts um which will actually walk us through the whole thing of setting up a project

so um it asks me to enter in a human readable name for my project i’m going to say demo

open source directions actually it’s cool with that

yes we’re a demo now um and i’m gonna i’m gonna actually just

limit this screen view i’m just going to press enter because i’m going to accept the default name

um and then we have a new project created so i’m actually just going to open up

this folder over here and now you actually get to see the start of the caterer project template so we see we have a place to

include configuration configuration in kendrick world means um how do i keep my hard-coded file paths

for loading and saving data out of my code base um and in some way that it’s it’s is

completely changeable how do i keep my parameters as well outside of my codebase making it easier for me to experiment

but also a place to have a single um a single place for me to actually just

be able to control my data science experiments so we’ll go through configuration in more detail data is essentially a place for me to

store data but remember we don’t remember uh we don’t uh we don’t recommend committing data to

git but you will see that there’s a certain folder structure that’s present in the data folder

from raw data all the way to reporting data and this is just a workflow that was recommended in

quantum black for how you think of processing data at different stages what it does it allows you to work with

your teammates quite easily because you know at every single stage okay let’s say for the um intermediate

stage that we’ve only really cleaned up like the column types in that column so in in

that layer um so i we shouldn’t expect to see any other type of data transformation in that layer

which is really good for reproducibility if you need to go back because you’ve made a mistake but in this case um you will see that

we’ve got the iris dataset example embedded in it because this came with this project input

so close that there’s a space for you to include documentation um so if you use uh google doc strings

in your code base we have sphinx integration that will automatically create documentation so that your code

um is is well documented data science code so everyone knows what’s going on um i’ll briefly show you logs which he

essentially um uses the python logging library um so you have a record of like what’s

happening in your kid where runs um la is touched on the fact that we do we do have integration

with um jupiter notebooks to support kind of like an explorer well i guess there’s maybe three reasons why you

would use jupiter notebooks in caterer world explore the exploratory data analysis

because it’s really great for that and you know that whole initial workflow we’d also perhaps use it for maybe

creating those python functions that you need before you move those into python script and then the third thing is that you

might also support a notebook for reporting and presentation at the end once you have your workflow

but it we believe like everything else should be in python script because there’s so many benefits that you get from that

and then we’ve got source which is essentially your python source code if i open up open source directions

you’ll see a python package there um and pipelines um we’ll be walking through data engineering and data

science pipelines here so what i’m actually going to do over here is go a bit deeper into each one of

the folders so you can see what’s what’s in each one of them and then we’re going to do a kedra run because kendra run now works out of the box

um and you’re gonna you you’re gonna essentially walk through your first uh kid or pipeline run with us you will see a range of supporting

hidden files um that are referenced here for instance isort um is used when we do when we have a

ketter lint command um so that you can link your code um and you will see things like um

our cli um which essentially are your command line interface which has um commands that you can add to the

caterer run command so let’s go into configuration and let’s go into um base so you’ll see a range of like

boilerplate which explains what’s exactly happening in the data catalog so this is actually where we talk about you being able to

specify file paths in configuration so um kedra uses what we call our data

catalog which is our series of data connectors you’ll see that we support many different file types here

so in this case the iris data set was in csv so we’ll just load it with using the

pandas api um and just say we’re loading it with the pandas csv you’ll see that we actually specified

where this folder this data is going to be loading from and it’s a relative file path to this

file over here so data one role and iris that we’re going to be

loading i mentioned that configuration also has support for parameters as well so when we go through the data science

experiment you’ll see that we’ve actually referenced our different parameters that you would have used in the different in the setup

right here for you so i’m going to close that you will see there’s an additional

folder called local that’s essentially where you keep your secrets so if you’ve got any credentials or

you’ve got any configuration specific to your ide um you keep it in here because it’s get ignored

um so it means that no one else is really accessing those things remember it’s not like best practice um to see um credentials in

a project um later to your comment about seeing data in the project you were shocked

it’s just the way that this example is set up in real practice people normally use files

basically cloud storage because they’re using s3 or they’re using azure blob storage so you’ll never

actually find any data populated in that folder at all um but the folder itself the contents of

the folder still get ignored um because that stuff shouldn’t be committed to get ideally so i’m going to close that um dogs

follow i might actually run through that command if we have enough time um but the next one that’s most interesting and probably

the most is like the bread and butter of of kedro is essentially how we think about constructing a pipeline

all you need to know is how to write python functions so i’m going to open up the data engineering pipeline and we’re

going to have a look at a folder called notes so here’s where i actually introduce a concept to you in kedroland

a node is a python wrapper that has space for an input and an output

and you actually see it when we construct the pipeline together how that actually works but you see in the nodes folder that all

we do is essentially specify a python python function that’s it this is all

you need um to actually get rolling and kedra when i actually open up the pipeline.pi file

we actually now get you now get introduced to our pipeline extraction in itself so this node takes in that python

function called split data okay its inputs are the example iris

data that we had referenced in our data catalog so in configuration that iris data set we also take in

there’s parameters that we were talking about in this case it’s a test data ratio split

and the outputs of this are a train and test split for x and y that’s all we do with this

very simple pipeline if i have a look at the data science workflow as well

you’ll see some more python functions under nodes so one that trains models

another one that predicts and another one that reports on accuracy

and when we have a look at the pipeline for this one you’ll see so we’ve got one node right here so it

took in the python function for training model and it had certain inputs specified here

our example basically our train data sets for both x and y and a series of parameters that were

needed for those um data sets to work and we output an example model

this example model is actually taken in as an input for the predict node so we now do

the predict function i’m taking the example model taken another the test one of the test

sets for x and have example predictions come out and then when we want to report on the

accuracy of this we have another python function report accuracy we saw in the previous version

and we take in the example predictions we take in our test um the y um test y split and we don’t

really have an output it’s none because we’re just going to be checking that from logs so let me actually show you what it

looks like from code um or in terms of the actual um locks for this one i’m

going to do a cable run oh wait i need to change into the

project directory um open source directions

cool and now i’m going to do a quick run

oh so now we see some logs um that have now put it here um we see that in our data catalog

we loaded in that example ios data set from configuration it was that csv data set that we needed

um and we loaded all sorts of parameters that we needed for this experiment to run but the only part that’s perhaps useful

to you is this one the model accuracy is 100

um which is essentially like how this pipeline runs it’s a very silly one

um one additional thing that i might want to show you now um which is this pipeline is actually pretty simple um

we we try to stretch it as much as possible to show as much detail as i can um but i can’t show you what it

looks like when we talk about more complex pipeline visualizations so i’m going to open up a kendrives

pipeline visualization that’s kind of like the last step um in this workflow and you kind of see

while kendra has applications on kind of like more simple workflows um it

definitely has um space um to work for projects that have

a thousand notes and ten people teams working on it but it’s also still good for applications where it’s just you

and it’s just your university project and you just want to make sure that you have a reproducible workflow

so if we look at this um what you’re seeing over here is an example you could say an example

retail application pipeline um so if i actually filter down to the data engineering side of it when i refer to

this process of data engineering purely the data processing stages cleaning data transforming and creating

features so we take in some shopper data um we load it we do some

data cleaning in the intermediate layer remember we spoke about that whole thing of the layered workflow

um and then eventually we create some features here at the end if i have a look at the data science

pipeline for this um let’s actually break it down a little bit further

um we have some model explanations that are done so we take in some features we want to use some form of explainable

ai on this pipeline we we work with that we also don’t want to implement some form of performance

monitoring as well we do some model training obviously because we need to put some outputs out

i don’t know this pipeline is really random you have some optimization steps too and then we have a reporting layer where

the outputs land up with dashboards and this is essentially a tool that you can use for communicating with different

teammates on your teams like maybe um non-technical teammates that still

need to understand like how the data pipeline works because you can have a conversation with them at this level um about what

um your pipeline is doing and don’t have to try and scare them by showing them code um for them to get what’s going on um

we find that kid reviews is used that way um but some teams will actually use it to onboard um team members on to how their their pipeline

is um structured so yeah that is essentially kedro in a nutshell

um with you looking at pipeline visualization um you’ll be able to find this online um if you have any more questions

um that was awesome thank you so much i think it was extremely polished and

like i said in the chat you can see the love that went into the design of this thing like you can see that it’s pretty simple to

use and pre-human uh in the interface and everything i i found it awesome

um so just quickly i don’t know if any of you want to discuss a bit about the road map

and like broadly speaking where does the project go from here

cool um so with this one as well and i also saw some questions about desert dusk um we have like one we spoke about the

data connectors right and how extent they’re very extensible so you will find a das data set embedded in there if you

want to create more um in the set go for it so definitely go for it in terms of contributions um and then we mentioned

that kendra’s place fits in the space of how do i get stuff to deploy how do i get good quality code high

quality code that i’m proud of that is deployable so when we look at like how you schedule runs on different systems we actually

leave that flexibility to you but you’ll you will see things like we even support a kedro airflow plug-in which converts your kedrop pipeline into

an airflow dac so you can take advantage of airflow’s amazing ability to do good orchestration and

scheduling for you now in terms of in terms of the road map um you will see that we’ve been

expanding um the how we think about hooks um in kedrow

as the concept is kind of borrowed from react but it allows for more extensibility across the kidro framework

for you to plug into the different parts of the framework um you will see that we are um adding new things to this so really building on

that whole thing of people using uh kendrick is to talk to their technical teammates about how their pipeline is structured

so look out for this amazing like side panel when it’s eventually built that will actually show you your code for your different workflows and also

what configuration you were using in the site um we have another

internal product that will be renamed i’m not even going to give you the the name because like you know we we’re

batted names when it’s on the inside but when it’s on the outside the name will be fixed um and this

um this product specifically helps us with the concept of experiment tracking um where we look at like i know i’m a

data scientist i used a random forest model here um i used these parameters to make this random forest model

it was maybe it had an accuracy of 92 but then i did things and then it was 67 and i wanted to go back instead of me

taking notes somewhere um i actually just use something that has logged these things for me so it’s easy for me to go back and revisit my

old workflow and that’s essentially where this functionality fits um and we will be releasing it um in some form um on kedro

as well um and then we talk about um work with great expectations we have another um internal plug-in

great expectations is this amazing library that does data validation so think of it as this whole thing of

when i built my data pipeline um i was using a table that was on aws s3 it was in some

test environment and it had eight columns that i needed to run my pipeline when i deployed it in production

the table had six columns because someone had removed columns from the data set how do i know that my pipeline is

failing because it’s a data error and that’s essentially where great expectations place role because it will essentially tell you

it’s failing because these two columns are missing um and you know exactly where to go and fix the era so that’s actually where if you guys

haven’t checked our great expectations it’s a great project um we have an internal plug-in that’s being dog fooded internally

that we hope to release as well um so you can look out for that um and then i guess with all of these

things like we’re looking at how we further position ourselves in the open source data science community because

i think someone um waylon actually says so one of our really key users he’s like

um we’re caterers in a space where everyone needs a framework for how they work but people don’t know that they need a

framework yet and they’re searching for kedros so really how do we get awareness out in that space is like really really

important to us so yeah you should see very exciting things um coming on the keterwind

and yeah we’re very excited for the

future so i don’t know if please wants to say something

oh sorry i was aware of time i was a little bit of a little bit aware of time now

um yeah so i think i think he took over all the covert

all the the expansions that like all the roadmap that the that we’re trying to develop the

future for for uh casual for now uh i

guess that well i’m i’m not really developing things that much like i know where we’re going but i

guess um i would love to know from the community if i could give like a shout a shout out to everyone i

would love to know from the community which kind of uh tutorials you’d like to see which kind what is it that everyone is

looking for because we’re trying to make kedro more and more accessible for everyone and there is a lot of love

putting put into making that that framework and we we absolutely love kedro um

and we want the community to love ketchup as well so help us so would you like them to file

an issue if they if they like want to add something is that is that your preferred pathway

yes please do yes and uh

so going over the questions i guess we have only time for one question but it goes

fits really nicely into what you were saying because the question is um do you have a page with some specific

examples uh if you have a problem and how do you solve that with kedro i think that’s a great idea for a

tutorial actually yeah yes i agree go ahead

put it onto the backlog but not like you know when pm say that they normally mean that it’s not gonna be done um but this one is actually a ticket on

our backlog but it’s supposed to be done in the spring so like we will we will have that page up for you guys

and we’ll share the link with you awesome so i guess uh we’re coming

to the end of the episode so we get our rent or raid section which is uh each pers each

person gets a 15 second soapbox to rent or rave about whatever topic um you don’t you go first

um i’m ranting um about the end of summer i’m i’m very worried

that there’s no there’s as a south african when i see this encroaching

nighttime and it’s going to be dark at 4 pm i get very nervous and it’s coming for me

life i’m going to be insensitive and i’m going to rent about kovid

oh that’s okay i’m gonna rant about social distancing and how all human contact contact have

been having for the last three months so uh remote and zoom based that’s what

i want to rent about it’s it’s it’s making my days not as great as they could be but we’ll survive it’s all right

mandy ken okay i’m gonna i’m gonna rave about how wood chucks run so woodchuck

is like a small you know forest animal they have there’s some that live outside my house

um and so sometimes i see them like walking around they don’t really like humans so when they notice me they run and

they’re like a little blob of fur they sort of look like an otter but running a really chunky otter

and when they run it’s extremely cute so um check them out they’re hilarious and amazing

so my rave is going to be the opposite of it on tundes because it’s starting to get hot in here in brazil so

it’s almost spring so the time the weather is nice and it’s starting to be warmer so i’m happy about

that i think we can maybe enjoy some time outside now that things are getting a little better

wow you’re rubbing it in for us northern hemisphere people i’m sorry [Laughter]

and food and now we just need to come and visit yeah this is like a sandwich of making us

jealous about brazil i feel like i can all be very welcome here

i can’t back you up on this one though i could back you up on the food one but this one no i’m in the island

there is rain every day even when the forecast says there’s gonna 22 celsius that’s our summer

it’s rainy and it’s cold well i’m so sorry for you but that’s how

things go unfortunately so but i’m in brazil so

i’m okay [Laughter] anyway um that’s all the time we have

for today and i thank you all so much for watching for listening and also for participating

it only in liaise that was awesome you can find us on twitter at open teams inc and at quonsite ai

uh yet only where can people find you and kedro um so kedro um easily accessible if you

search on github for us um you’ll find um everything related to the project there um if you ask me to ask more questions

head over to stackoverflow um and definitely do that but otherwise you’ll find us on twitter as well i’m

there at youtube.uh i as i ask questions i harass users in the nicest

way to learn for feedback learn feedback um about the project and i’m pretty sure

you can also find lice um that way as well ladies underscore bsd

yes indeed so like underscore psc and we posted both of them on the chat and yet yet

uh we’re both on twitter we spend a lot a little time on twitter so you can find us there you can find us on linkedin

you can find us on github you can find us under the maintainers and collaborators on cadre

the cadre github page um if you google our names i’m pretty sure we’re gonna be there somewhere

as well so yeah just send us dms just say come and say hello we love talking to users all over the

internet so if you liked what you saw today please go to your our

youtube channel and like and subscribe to see more of this content we look forward to you joining us next

episode uh so you can drop in for a discussion on

drupal sounds great

Resources

Company