PyJanitor Book Open Source Development

About

Originally a port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method chaining paradigm. Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format.

Transcript

[Music] hello the internet welcome to open source directions hosted by open teams
the webinar that brings you all the news about the future of your favorite open-source projects I’m a deacon monk
and I’m excited to be the host for this episode of open source directions I’m a postdoc at the University of Illinois at
urbana-champaign I work on the YT projects and I’m based in Urbana Illinois shocker
coasting with me today is hi everyone my name is Henry berry and I’ll be a
co-host for today this episode I recently moved from Sydney Australia to Austin and to join open teams of the
growth marketer so that’s very exciting and what will new be an open source I’m very excited this episode so on that
note I’d like to introduce our amazing guest for the episode Eric alright hi
everybody I’m Eric and I do research at the Novartis institutes for biomedical research in Cambridge Massachusetts so I
work in the team called scientific data analysis it’s technically part of IT we’re simultaneously a special ops team
and our spare tire for colleagues if we want to talk about that in another setting I’m happy to do so as well
my current specialty is to bring Bayesian machine learning and statistics to our you know work in biomedical
research and my managers frequentist but we get along very well well thank you
for introducing yourself Eric now before we dive too much into the meat of the episode let’s go to our famous tweet of
the week section where each of our panelists presents a tweet but they’ve been enjoying recently Eric what do you
have to share with us today
definitely so the tweet that I have one is one that’s communicating math in a reader
friendly fashion so you have tweets that look like this and inside there that the
key point being is like equations are very very abstract so making concrete
what the terms are in equation is really good even better if we can like pair the equation with it
then that really makes it like come home for those of us who are programming
types yeah this seems like a really nice way to also explain to somebody who
doesn’t necessarily understand the syntax of the math as well you know to understand the physical application that
the equation is sort of applying to okay
what do we have next so yeah my tweet of the week is
something a bit different but I stumbled upon it and I thought okay it’s kind of creepy but it kind of is this new
functionality with the airport’s so essentially if you turn on you go to your settings and you go to control
center you can turn on hearing aids a kind of thing for your airports and say if you leave your phone in the room and
you go to a different room you so have your air pods in you can hear what is being said in that room based off your
phone so sort that was something to share and if any of you use that out there maybe they can but just something
a bit sneaky yeah definitely sneaky
I also saw in like the answers for this that you could use it as a walkie-talkie which seems kind of cool that seems like
a really interesting feature so for my tweet of the week I’m a little biased
here at some of you know I’m into this so this is something that came up last week in related to the YT project which
is the project I work on and this is image that is composed of four different
fields from simulation data and so this person used YT to visualize each field
and then stitch them together using imagemagick and I just think it’s such a beautiful visualization I really love
how the fields stayed together and you can see how in the same region of the
data you can see how they kind of how different the fields can be visualized so I don’t know I found it really fun
that’s something you definitely just put on a big poster and check on your wall I know yeah some people have such cool
simulation data and incredibly jealous because not sometimes
you know you just have to do a 2d platen that’s you know it might be interesting but it’s not as beautiful like it’s not
a circus right so colors ways to bring it up right it’s true and good color maps as we all know right okay well so
now that we’ve talked about the tweet of the week let’s dive a little bit more into PI janitor so pi janitor is a
project that extends the pandas with a bird based API providing convenient data
cleaning routines for repetitive tasks you can find the source code for pi
janitor at github.com /eric mjl / pi janitor it has four hundred and seventy
stars on github as of this morning and across pi PI and Conda has about two
thousand downloads a month Olson yeah Eric it’s very very
interesting because it obviously it’s a very topical thing now with AI and machine learning data cleaning everyone
knows how much of a pain it can be so I’m interested to know why I was the project started and and what need is it
feel in your opinion yeah definitely so
um I basically found myself using pandas pretty heavily in grad school and then
when I started working realized that you know I’d been doing the same data cleaning routines over and over and
writing code over and over that was essentially doing stuff that I did in a previous project but one thing that I
didn’t have much experience with back then was putting stuff together into a
Python package that you know could distribute and then let other people use so and I also didn’t realize how how
common some of these data cleaning functions might have been across other
people’s work too so it’s just sort of like I had these things I knew how to do
it they were all in my head but I didn’t know that it could be useful to other people so then when a colleague at work
showed me the our janitor package that’s when a light bulb just like went off and
it was like oh yeah this guy made this package janitor
he’s sharing functions that he’s used for the whole our data frame dpi our
world maybe you should do the same for for pandas so then yeah maybe you should
have shared a shared library that puts a bunch of data cleaning and processing routines into a single package that we
can distribute that more people can benefit so initially it just started like courting over the our package
functionality I remember the one that I first started with which was the clean names function which does of one and one
very specific thing it makes all of your names column names in a data frame in a
clean format and that is to say it’s got no spaces got no special characters you can access those column names with a
data frame dot something attribute syntax that pandas enables us to do and
so it also seems to be the one that really scratched a niche with many people because it received a lot of
attention there is a lot of people that made contributions to that as well so in any case like the the need that the
project fills is basically a library of common data cleaning functions that we
might all need at some point in our projects and you know like basically functions that would wrap two to seven
lines of commonly used commonly written pandas code that now we can method chain
and read off as like a single step in our data cleaning routine now apart from
that it also serves one other need and this is a need that I’ve been trying to this is a point that I’ve been trying to
promote at work more data scientists write code write I write code and
writing code equals to writing software and writing software means at some point
I’m definitely going to need to organize a document and test because someone else is going to be looking at my Jupiter
notebooks so all of those correspond to software engineering practices basic
software engineering practices a janitor solves a need for me in that like I can
practice being a software engineer so that I can be more effective in my good job as a scientists I loved hearing this story of
how it progressed and also you know I it’s amazing to kind of reflect on how
much of the sci-fi and PI data stacks have been influenced by grad students
having to solve problems for their data right I definitely wrote some very not
amazingly useful code for a grad school but you know it’s what really got me into software and finding all the tools
help me visualize my data so it’s really cool to hear that you couldn’t have this moving forward as well so I was looking
over your documentation and you have this really cool logo and the kind of
scripted name on your documents can you tell me the history and the name of the
name in the logo yeah definitely so as I
mentioned just now it’s a port of the our package janitor so the name is quite natural that we you know imported as
janitor but someone else had another package register on pipe I as janitors so we had to go with Pi janitor
otherwise it would have been Python janitor and you know our world is good with making coins and Python we’re not
really that good we just do PI something so cool yeah janitor now recognize I
recognize they’re like this this is a copycat package so in a nod towards the our community so there are gonna be some
our losers who are now embedded in a Python primary or a primarily Python
environment and so if they’ve used the janitor package you know coming over to
the Python world I’ve tried to make sure that there’s some form of parity between the Python version and the R version and
if it’s not possible right where there’s some you know there’s some things that Python can’t do like non-standard
evaluation for example that means some functions can’t be reimplemented exactly
in the same way in in Python then in those cases we we go in we serve default
to how do we make the code readable right with all of the arguments readable
so you can just sort of something like transform column this is the starter column that’s the function
that’s the destination column right that’s sort of how we how we approach that now for the logo it’s really cool
the logo came out like this it says welcome community collaboration this is Pike on 2019 I wanted to lead a
sprint for a pie janitor so I went up on the Sprint stage and said all right we’re leaving a sprint it’s data
cleaning routines and if you like if you work with pandas and you want to make life easier for other pandas users come
to the sprint room and so I put up a sprint and it was really cool because
there’s like about 20 20 odd people who came by and we got a lot done that
sprint but in order to make sure that people could find out which like which of the sprint rooms we were in I needed
to have this like big humongous thing that would be iconic so I drew a
broomstick just like like this on a 3m tree a
sticky pad it was a completely ugly broomstick actually some people said it
looks like a rocket pen a drumstick rocket taking off so but it served the
purpose right so so during the sprint one of the contributors she had just finished the pull request and he was
waiting for the build to complete so he’s he looked at that logos like Luke Lucas looked at Lucas Lucas Kushner is
his name he looked at the logo whipped out his tablet and like took a photo of it and then just started digitizing it
then he made a second PR title yeah I
hope they’re a contributor forever contribution to the project it’s awesome
are there any alternative projects out there
yeah so for the alternative projects there
were quite a lot and a lot of it stemmed from pandas users who used to be our
users who want a deep liar and we saw a bunch right there’s PD ply D F ply ply D
F they’re all some variants of that name we detailed a lot about related projects
in our Cyprien conference proceedings paper and so that link can probably go into the chat in a moment so there’s a
lot of detail there but the main finding was everybody tried to replicate the pipe syntax and tried to replicate deep
liar but then nobody was putting together all these convenience functions
and plus we also know one thing that I mean yes our has the pipe syntax which
is very automatic but in Python the idioms are to do method chaining if you want to do something that’s similar to
piping so we chose a design well we sort
of evolved it that way we evolved the design towards using method chaining and
then verbs for names which is again a thing borrowed from the are world and
then decided well that’s that’s how we’ll do it and then now it’s it’s grown into a community collection really of data yeah
thanks for explaining that can you tell us like you’ve kind of hinted at some packages that it’s related to a little
bit but can you tell us what technology by janitor is built on yeah definitely
so the primary thing that makes Pi janitor work especially with like method
chaining of functions that are that look like they’re native to a data frame that’s made possible by using pandas
flavor so it’s made pandas flavor as a package by Zach Saylor Zachary Saylor I
think if I remember correctly last time we chatted he’s still part of the Jupiter team and the key idea of that
package is that we can design functions that take a data frame as the first argument and return a
data frame and then we can dynamically monkey patch those functions onto a data
frame at runtime and so suddenly transform column which is not a native
pandas data data frame class method now can behave as if it is a data frame
methods you can do DF transform column and because it returns a data frame you
can then chain on another transform column or a rename column or a clean
names etc etc now I’m aware that depends devs prefer namespaces so they prefer
that we do something like D F dot janitor dot something but if you think about if I have seven janitor functions
and I need to keep writing janitor dot function it’s much more work than just
writing DF dot function so it’s with
that design that we decided not to approach not to adopt the name spacing
because the intent here is that we’re going to continually method chain and you know compose pandas native and
janitor native functions together now so that’s that’s the key core of how the
how this works there are some extensions that have been made and they are for the
X array and PI spark data frames those are two packages that do the nature of
my work I don’t use much so actually other people have made those contributions and their fledgling like
sub library right now I can’t wait to see them like graduate and become independent libraries on their own so
I’d love to see more users or port stuff over to work with these other data
frames that’s awesome yeah that’s a one thing I think if anyone watching if you know the contributor already definitely have a look at the github page and see
if you can help out I’ll see we you the person who started or who starred in PI janitor Eric
yeah so so first off like I can’t introduce
who created by janitor without first introducing the original creator of janitor the R package so his name is Sam
Kirk I think he lives out in the Midwest as well avid biker as well he’s the
original and I’m the we’re the copycat serving Python people all right just to make sure right so I showed the so so
then what happened a little brief history of how that how it came to be colleague showed me on his laptop hey
look at this and he found the our janitor package then I saw the clean names function in the our janitor
package I was like yeah I can reimplemented it and that’s sort of how
it just got started with one humble function that everybody ends up using and then it drew I showed it to a
colleague in at work and he took to it just like that because he loved the
method chaining paradigm and soon after after I started tweeting a little bit
about it people from all over the states in the world started like using it and
that was that was for me a very interesting and gratifying yeah
and it’s also fun to like get responses from people when you hear what they’re doing and if they do something totally different with the package you developed
it’s it’s such an amazing feeling and I want to remind our listeners right now
that you know right now we’re talking a little bit of hi janitor and we’re gonna go in the road map discussion a little
bit more about where you can contribute but if you are hearing things that interest you and you have any questions
please post them in the question section of the on the right hand side of the
webinar we’ll ask some of your user questions at the end of the episode in
the Q&A section so please ask questions anything that’s interesting you or you know anything related to Pi janitor
please add them to the questions section okay so now we’ve talked a little bit
about who started the project can you tell us who maintains the project right
now because you talked a little bit about all contributors is you know where we’re what is the difference here yeah
definitely so Mainers come are drawn from the contributor cool so mainly it’s myself
plus a bunch of other people and there’s actually some I’ve never met in person but I’ve given them commit rights to the
repository cuz they’ve been really good so the way I decide to give commit rights is basically someone who has a
long case so maybe I’ll back up a little bit commit rights give it to people who contributed in a diversity of ways so
I’m not only code contributors right so and I’m always looking for a community
people help right so an example of like some of the contribute some of the
maintainer x’ one is one pair JK and sally there Korea
and America and husband-and-wife team and they actually came to both sigh hi and PyCon Sprint’s last year so like
they were they really helped out a lot especially after their first like they first joined at PyCon and then they came
to Syfy and I asked them could you help me monitor the PRS and then because I needed them to say like just just merge
as soon as you see them pass I gave them commit rights and then they also helped out other other ways at the Sprint as
well but that was the the catalyst for me giving them commit rights and then there’s another colleague Zack Zack
berry he’s a colleague of mine he helped lead the side pie screen he’s the one who really took to it and helped with
the X ray extensions so he’s in convert Cambridge with me Hector is a grad
student in California and we’ve actually done more stuff together beyond just
talking about Pi janitors so he he’s in the bioengineering field that used to be
my old life and when he wrote a paper I I sort of like did a pre peer review trying to see whether we’re you know
reviewer number Torres shoot down the paper so yeah just tried to like make the paper as robust as done so yeah make
friends that way Shan Shan though she lives in Northern
California and she helped a lot with the documentation in fact there was one thing that I I just learned about last
year called semantic line breaks and writing with writing documentation in
plain text with semantic line breaks really helps with the maintainability of the written Docs you have you minimize
line changes when ideas change so it’s really cool there are two people yeah
and Paul ones and I think Singapore Southeast Asia for sure and the other in
California Paula’s an ex Cal Tech grad student if I remember they help a lot with designing
the PI spark extensions and there’s right now I think one or two functions that have been ported over and as more
contributors decide they might want this then totally I’m happy to see that girl in and then of course can’t forget Sam’s
Ackerman he’s actually of everybody that I’ve listed so far Tia and Sam not met
in person ever but they’re the two that also have commit rights to the the repository so John helped out a lot
early with code and Docs and he is I think a bit better as a software
engineer than than I am so I actually learned stuff from him too so she was also like I think the first
to propose putting in-depth examples into the documentation so at first we
did an experiment with like Sphinx style long documentation but then it would it
got hard to maintain the consistency of the docs across functions so I hope he’s
not too offended by this but I commented them out but and but then mark left them in the source code so that anyone else
who wants to port them over to a Jupiter notebook which i think is the better setting for this they can we can we can
get some help like I left them in there so that there’s help there’s an easy path to porting it over and but then
that that said though he did kickstart a bunch of good practices in the in the project and I’m really thankful to serve
a it’s really great to hear people when you give these recognition because I just think that we
one thing that open teams we believe is that it’s just not given out in the open source community and that’s why I really loved when you said that it isn’t just
the lines of code that you write really that to show a contribution like most work on open source projects isn’t
coding it’s it’s other things that’s really great to hear that yeah that’s something we’re trying to solve
for a problem we’re trying to solve what open teams and so you previously mentioned that you were you were flabbergasted because you had always
users from around the world the what community we say you use this and
contributors from yeah let’s see there
have been so contributors have come from all over the place primarily the ones
that have contributed more heavily are the so most of the work has been happened happening at the conference
Sprint’s soap icon and Syfy the two Sprint’s that was where a lot of development and documentation work
happened and those people came from all sorts of organizations those
contributors came from all sorts of organizations some were grad students some were working software engineers in
fact I think there’s someone from Bloomberg who wrote in who put in a few few lines of code in there and and their
colleagues took a photo and asked if they could like tweet about it I was very happy about that
there’s and as I mentioned you know the the contributors come from all over the place
one thing I know that I’ve been conscious that not putting any tracking code inside there so definitely that
makes it a little harder to know where people are using using it but hopefully
you know and yeah hopefully hopefully it’s being put to good use
definitely the the use of the statistics from the downloads on conv and pipe I if
there’s geo stats geospatial stats on there I’d be happy to look at it but you
know most of the time I’ve just found I guess the things that matter don’t really have to be counted so if people
find a good use for it stars on github are more than enough for me and you definitely have some good stars
github so that’s it’s great you know and it kind of does echo what Henry was
saying it’s really nice that you include people who contribute not just code to
be maintained errs of the project and give them commit rights you know it allows people to be able to contribute
who maybe don’t you know aren’t as comfortable with code or maybe just want
to get their feet wet a little bit with documentation you know people who you
know it’s like all different people so that’s really nice and so related to that is participating in any diversity
and inclusion efforts because this is certainly an inclusive sort of thought process right so not officially but I’ve
been influenced quite heavily by like being part of the PMC developer group
where they’re part of numb focused and I’m focused has its own initiatives and the likes the reason it’s not official B
is because this this project really is a federated side project for all of us so there’s no official affiliations for the
project and therefore no official participations and anything we haven’t even done like google Summer of Code
ever my my guess is since since the repo
is under my name I probably have some unofficial benevolent dictator for life powers so whatever intersectional axes
you can think of where we can look I
mean the the definitely I recognize the gender balance on the this those who
have commit rights is skewed so I’ve been looking carefully for like more contributions but definitely I’m not I’m
not forcing things because it’s a volunteer driven project and so it’s up
to those who have I guess the time and the willingness to to make a make an
impact knowing however I do have to clarify this too right it’s like knowing that it
doesn’t all have to be code because I think sometimes new contributors are a little intimidated by the fact that you
have to be a coder right or the sorry the perception that you have to be a coder to contribute to an open-source
project definitely not right like that I would love to see more contributions that are non-code like writing Docs one
thing that was really really cool and if you remember Shen and a few other who is
one of the main those who have commit rights the repository she and a few other people actually took the time to
sit down and write out first contributor Docs like how do you get set write line
by line instructions with yeah with even
a section contributor by someone who is a PI charm user on how to get set up
with PyCharm in honde environments and how to make them talk with it’s not even supposed to be specific to the PI
janitor project but they sat down and they wrote it together and that really helped helped me a lot when I was
leading the sprint and it helped other newcomers come to the sprint and be less intimidating because about 60 to 70
percent of first-timers to the project can solve most of their environment
setup issues just by following that set of Doc’s and so I was very thankful for
that because I don’t know how to talk about in how to be a beginner I’m certainly bias for so many maintainer
open-source projects you know when you forget some of those work clothes that you shift over to you know especially
like going from began to maybe use a little more intermediate and so that
sounds like such an invaluable contribution to the docs and I you know it’s great that super beginners that’s
really nice it’s very important I find just over talking with quad reviewers or given people have programming skills and
want to get into open source there’s just so many barriers and so by having that for beautifier that’s that’s
amazing it’s fantastic and I think you’re gonna put tags on your github issues to say that my beginner-friendly
I think it’s just really breaking down these young away for these first-timers to feel comfortable coming in because it
I’m actually just walking into a school on it walking into just a large project room and just sitting down and being
like I’m gonna help in this way this is how what I’m going to fix no one really does that for the first time and so
really the only difference is that we’re doing or you never really meet the person so I think it’s definitely put me down those
values is a fantastic way so that’s great to hear so now we’re gonna yeah thanks a lot true so now we’re going to shift into
the project demo where we’ll get to see some of the cool features of Pi janitor and how it works so Eric well Eric’s
getting set up here we’d like to take the opportunity to thank our sponsor Quan site for sponsoring this episode of
open source directions Quan site creating value from data so with that
Eric when you’re ready and you’ve got your screen shared take it away
okie Dokes let me see can you all see my screen yeah this is your screen okay
you might want to make it a little bigger so definitely definitely so that
looks good oh how does the font size look now great okay so I’ve got up here
a bunch of notebooks two notebooks primarily that I will show these come
from the Jupiter notebook examples that are part of the PI janitor documentation so in case you’re curious the URL for
the PI generator docks is PI generator that read the docs that I owe and if you go to examples there’s a whole bunch of
like Jupiter notebook that I’ve been converted to sink stocks that you can look at to see how how to use certain
functions and the likes all right so in my case it just for today I’m gonna showcase two notebooks
which I think give a flavor a pretty good flavor for how things how
the how the syntax of the project works
all right so first off we have this notebook that talks about this function called group by AG and the way that the
way that group I ad works is it says let’s say I got a data set that looks
something like this item MRP and number sold but I want to attach as a new
column the average MRP for shoes to the shoes rows in the average MRP for the bags to
the bag gross right if you were to do this with regular pandas my guess is you might do
something like let’s see I’m doing this live DF group by item AG average MRP is
equal to m RP and then you do mean and
then you’d have something like that now so grouped means is equal to that and
then you’d have to do back D F dot merge group I’m sorry
that merge grouped means that reset index I think that should get you to
where we’re supposed to be yeah there we go shoo shoo-shoo bag bag that’s that’s that’s pretty much how we would do it
right but that’s two lines which is probably one line too many for what I’m
interested in writing so someone who is lazy enough in their day today and I
mean lazy and the Bill Gates kind of I want to hire lazy people kind of lazy
someone’s lazy enough in their day to day actually put in a group by AG
function that allows us to perform that operation so if we do first off action
so here’s here’s by the way the clean names function that famous starting point so if you notice now the item like
this this is this is the column names have changed and there this one is like all lowercase that’s the effect that
clean names does and if you do a group by AG yes I said it’s more lines of code
but it’s a lot easier to read you grab group by the item you a grid’ it by me and you aggravate over MRP and you
create in your column name for that and then suddenly they’re all there and we also have the added bonus and that the
original order is actually preserved so there’s a little bit of magic that went another Neath the hood and that’s that’s
one example a convenient data cleaning function that we now have as part of our
two okay one thing that I do remember do encounter a lot when I’m doing Bayesian
statistical modeling is that if I have an item that is categorical usually I
might if I’m say fitting a linear mix the text model or you know doing some hierarchical modeling I need these
string terms to be encoded as integers so that I can group and index things
correctly and what you might do is you might bust out oh let’s get the scikit-learn label encoder
I mean sure right four more lines of code but what if you could just do label not labeling code and get them all done
like that so that’s sort of like a convenience again it’s all about convenience right like it’s it’s and
putting in verbs as names so group by AG clean names group by aggregate and then label in code this particular column and
get back and we made column right that is that that’s sort of like the whole
mantra behind here so that’s one example the other quick example that I’ll show
is actually an example that I copied directly over from the R package so once again this is sort of like like once
again like this is the copycat that our package is the original so you have this
like excel file and I know I’m gonna sound like a snob saying this but like Excel is horrible especially if you deal
with the Biosciences gene names get covered by Excel don’t use Excel I know but also if you use Excel you can
get users who do crazy things like this they keep an entire column that is empty in an entire row that is empty just for
the sake of visually separating things but tabular data if we tabular data was
never really meant to be like that right so there are some things that are that we might want to do with the data frame
that would help us figure out like help us clean the data conveniently right so
in this example the famous clean names function comes in it’s still it it looks
for a certain subset of special characters but then some of the it might it might still do it still
might preserve over there so but as a first approximation between the column
names are already a little better now there is a function that again poured it over from the R package called remove
empty that removes this do not edit and removes row number seven so if you do
that then it removes that empty row and then reindex is everything correctly and
that like do not edit column is suddenly gone it was previously in between full-time and certification and so now
between full-time and certification we don’t have it then there’s renaming columns usually if you wanted to rename
a column in panda as you do D F dot rename and then you need to do a axis
but because this is explicitly rename column we sort of just wrap that in the
convenience function that you can just call right so you rename columns or if
you’re a little bit more but if you if you instead want it to mix and match this with pandas native function which
is totally possible you can do something like rename and then you pass in a dictionary % allocated % allocated etc
and then axis is equal to columns if I remember that API correctly so one thing
about the PI generator project is that we try not to clash with native pandas so you actually can mix and match very
easily and as function names alongside a pendous class method sorry alongside
janitor functions which have been monkey patch as one thing I want to so if you
can show maybe like one more highlight I’ll go into the road map so we get time to talk about where you want to go with
the project definitely I like this one I
hurt Excel date and it’s something that converts that thing that was a
clobbering of the excel date into the right the right thing and in fact I
think the Mikki Fernandez who is one of the kana Forge he contributed a few other things which
like convert eunuchs they convert some other date because he had to do that and so he helped a lot like just getting
conversion of dates into into the package too so that was really nice and now success and dates are such a
challenge all right data frame operations there are any type of operation in Python I don’t know it’s
like such a headache so so now we’re gonna now we’re gonna move into the road discussion
another fun segment of this webinar we’ll talk about where PI janitor is going in future directions that we’ll be
taking for those of you who are listening these items are places that my janitor is going to be looking for code
or funding or people to participate in the project those are all things that PI
janitor would like to see happen moving forward so with that Eric can you tell me about what directions you want to see
PI janitor go what things you know what new features it could have things like that yeah definitely so the first thing
that I’m hoping to see is more people contributing examples particularly if
you’re not using it in work right if you use a PI janitor for work
keep your proprietary stuff proprietary but if you’re going to if you use by
donor for work and you have some spare time and you do some side projects with that use data cleaning methods I’d love
to see them as like examples in the examples directory because that is a
great way to show people how the package gets used so one one particular thing
that I’m hoping to see is more contributions of examples of how the package is used the other part then and
and part of the pen is one thing I recognize is like it’s still stuck to my
username right now and I’d like that to be more like PI MC 3 where it’s stuck under a PI MC devs github organization
rather than my own personal user username that said there’s a bunch of
stuff that’s sticky right now so like like continuous integration systems all under my account pi PI is all under my
account I get hit by a bus someone else has to know Best Actor touchwood I don’t want that to happen no
bus factor oh I definitely want to keep
one thing I do want the project to stay on is on track is it should not be
monopolized by a single organization the way say some open-source projects are
even some of the ones that we really love right but I’d prefer to have them not monopolized by a single organization
so there’s no in implicit glass ceiling that comes from internal discussions
that happens happen at work that nobody else outside of the org is part of so that’s definitely one thing that I’d
like to like to see stay see the project stay on us I have been extremely
horrible at doing promotion of the project and social media work recently
because I’ve been completely swamped at work so definitely would appreciate
volunteers helping with that aspect with so with social media promotion of the
project helping with documentation encouraging usage and of course if there’s new data cleaning fun I think my
goal to other people I think that’s a that’s a great thing that we would love to see inside there to be sort of
organization agnostic maybe that you’re not dominated by a particular organization do you have a formalized
governance structure would that be a place that somebody could help out like
you know coming up and proposing something yeah yes absolutely absolutely
I am actually not very well versed in this stuff so sticking as a thing on as
B DFL is probably not the long-term thing for me to do definitely like I think I remember Matt Rockland’s blog
post on how to do seven step seven stages of open source and the last one
is retiring because the project lives on without you yes
that’s right so okay so you’ve talked a little bit about some sort of you know places that contributors can come in and
help you know are there like sort of big new features that it would be helped it would be helpful if somebody wanted to
help fund the project or you know funded developer time to work on the project that you can see would be a good Avenue
for that yeah I’d say so we have we have
as an unfortunate fragmentation state of the ecosystem implementations I do wish
everybody would just settle on one data frame and move with that build on top of
that but that’s that’s not the state we’re in right now so definitely if
there’s interest in the X or a PI spark and my favorite desk data frame porting
those over such that they’re compatible with those that would be a great place
to be for me I used ask that then I’ve used ask not the data frame part as much
because my data hasn’t been that big but I’m pretty sure those who deal with data frames at scale probably might benefit
from a few of the data cleaning functions that are available here yeah
and that would be like it would really I think changed a lot of people’s data operations also you know in industry or
in academia whatever they’re doing their analysis in you know only yeah okay well
this sounds really good and lots of cool avenues again to contribute you have new
features that would be helpful documentation things like that so with
that let’s switch into our questions our Q&A section we have two and maybe Henry
can read the first one yes what are some of the interesting use
cases that you seen or PI janitor yeah let’s see interesting use cases I have
to think about that well I can if you’re okay with this
Henry I might want to broaden the question a little bit because I can’t immediately think of a of an interesting
use case but I have seen the the package when used by people who are in finance
who are in so Sam Zuckerman is in finance so he’s contributed a number of
finance things and he actually also helped with contributing the if I
remember was him earlier one of the two contributed a a function that allows us
to do inflation adjustment of prices so
you can adjust some some old historical price into modern times that was an
interesting one that I saw there’s I myself have had use cases for in chemist
chemistry chem informatics for example so there is a chem informatics sub
module over there and so I use it for chemistry machine learning purposes like
baseline modelling and the likes so those are those are maybe maybe those are maybe those do count as like
interesting news cases okay so when I
was looking through hi janitors Doc’s I noticed you had API documentation for a
lot of different domains like biology and chemistry and machine learning are there any domains that you think
haven’t realized the power of Chi janitor or like you would like to see
additions in the API documentation from
yeah one thing that I know could be pretty handy is geospatial analysis so
when you when you have long lat coordinates you might want to calculate a column that goes like distance to
another reference coordinate for example and so you know rather than reemployment the math every single time for every
single user why not we just put in putting a function that does something like that so geospatial analysis is one
I think where people might might find it handy in network science you can represent a graph as a
see matrix so an adjacency matrix with attributes for the edge can be
represented in basically one data frame so I’m pretty excited to see whether we
can like do automatic like if you give me a data frame I give you one more
function that calculates something like
what do you call centrality metrics without you needing to jump out into
Network X and then come back into a data frame right like stuff like that might be might be pretty handy yeah that would
be extremely cool okay so related to actually what you just answered you were talking about
like calculating distances and stuff do you already have an internal unit handler or would that be a place
somebody could contribute I think that’s a that’s something someone could
contribute especially since there are I think at least two unit handling
packages that are available for the Python world one being unit and the other being Astro Pawnee if I remember
correctly so definitely like seeing some interoperability there would be pretty
cool okay cool so another sort of pathway not going back to our yeah okay
so now taking the time to sit with us so
to round out the episode let’s go into our rant slash rave section where each person gets a 15-second sofox to rant a
rave about whatever topic they feel passionate about this week so Erik why
don’t you start okay don’t touch your face do the corona shake so bump your
elbows and we need more roundabouts to slow traffic in cities that’s my rant couldn’t agree more
there are no roundabouts everywhere in Australia and I could have no roundabouts to you red lights and
turn the ignition key turn right at the red light on a red light that makes a difference I’ll be a white riot
I saw another no share my experience this morning was pretty bad starts this morning but before the webinar I looked
at my computer battery and I had about 20% and I was like I know went to grab my charger realize that I left it in the
Quan side office yesterday so I don’t have a mode of transport so I quickly jumped on the bike the bike and they’re
pretty fast these things I get there I went to get my charger luckily and then
was running out a bit of time looking at like 10 minutes as I quickly grabbed some food from Starbucks I thought I could manage it in the little bike
handle thing of the beginning lever the front the basket and I quitted gladder
up and everything put two bids on it and the first little ditch that I go down it just spurts and goes everywhere all of
my croissant my croissants now stop even when we coffee and I’m looking at his coffee like this I was really needing
this really looking forward to it yeah I
try to wipe it down but it’s still sticky this is a rant about potholes and
how they destroy your life it was a toy so well it seems like actually a lot of
our discussions today have set around biking and cars so my rant I plant
before is gonna happen is I get really frustrated when I’m driving and I need
to get into a turn lane and the turn lane is only so long and a car slows down because it’s a red light and they
slow down way before they even get to the light and so they’re just they’re just kind of posting so they don’t have
to use their brakes but then I missed the turn because I can’t get in the turn lane and I have to wait a whole another
because they were just like it like was infuriating I get so angry and it’s like
ridiculous slimming 90 extra seconds but what I’m told is cause it’ll come yeah yeah just like spiking like me in
the mic but you know anyway that’s all the time we have for today yeah yes
Midwest Safari me up for today thanks for watching thanks for listening you can find us on twitter at open teams ink
and quant site a I if you’re interested in funding open-source projects including Pi janitor you can find all
the project room roadmaps at open teams comm / projects Eric where can people
find you and PI janitor github so the
links are available in the chat github.com / Eric mg/l and then /pi
janitor for the repository and i’m on twitter under this as well so you can also ask your follow-up questions maybe
on twitter if you really have burning PI janitor questions join us again next episode where we’ll be booked doing a
book worthy discussion on the novel package Jupiter book okay thanks
everybody have a great day
[Music]