Pandas Open Source Development

About

We were joined by Joris Van den Bossche talked about the work being done with Pandas. Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

In 2008, pandas development began at AQR Capital Management. By the end of 2009 it had been open sourced, and is actively supported today by a community of like-minded individuals around the world who contribute their valuable time and energy to help make open source pandas possible. Thank you to all of our contributors.

Since 2015, pandas has been a NumFOCUS sponsored project. This has ensured the continual development of pandas as a world-class open-source project.

Transcript

hello the internet welcome to open source directions hosted by open teams the first
business to business marketplace for open source services open source directions is the webinar
that brings you all the news about the future of your favorite open source projects
i’m a mathematician my name is melissa i’m a software engineer based in brazil working with open source software and
communities at quansite and i’ll be your host for open source directions today co-hosting with me
is hi i’m tony fast uh i’m a developer advocate at quansite
and i’m a huge jupiter fan and a huge pandas fan so i’m super excited to have yaris with
us today um yaris could you uh introduce yourself please yeah um so hi al um
i’m one of the maintainers of the pandas library that’s the reason i’m here today
i have a bioscience engineering background uh i did some research related to air
quality uh but yeah after science i moved to software engineering and i’m currently
working at uh part time at ursula labs where we have a team uh supporting development and community
of apache arrow something that will come that we will uh yeah later in the talk in the in the
in the chat uh we will talk about apache error as well i’m also maintainer of geopandas which
extends pandas to um to work with geospatial data i do some teaching but yeah
i’m looking forward to to chat about pandas this uh this hour great we’re glad to
have you this week we have our famous tweet of the week section before we start talking
about pandas and so each of our panelists will present a tweet that they have been enjoying recently
here is your first app yeah um is the tweet shown or do i need to
show it or um we’ll share it in the chat so folks
yeah yeah um so it’s a tweet um about an interview with
uh santa santa blau it’s not directly related to to pandas or open source
and it’s actually the the tweet itself is from this week but it’s more an excuse to
to mention her and her book which was already released earlier this year but
sana blau is a journalist at the correspondent um which is an online platform for what
they call unbreaking news so they they don’t want to do the breaking stories
like urgent news but longer running more in-depth uh stories and sun is
writing about in general about numbers um and the book is about how numbers
yeah have a big impact on our lives but also about how it’s very easy to mislead with
numbers um and yeah she really writes uh in a in a very entertaining and and like
easy-to-understand way with a lot of stories about it and i uh i read the book this year and i
really uh enjoyed reading the book and in the end it’s about
it’s about data and and so there is certainly a link with the topic of today
well that was awesome i think that we should add a whole new topic to open source directions which is book recommendations instead because i
want to read this book um but nevertheless i’m going to go with a little bit more hyper literature here
and actually do a tweet this one’s very cute and it really convinces me that it’s turtles
all the way down uh so if you think that you’re having a hard time programming imagine what life
is like for this little turtle writing punch cards trying to suspect sustain open source the speeds that we have to work in
oh goodness you pity that little guy um but yeah that’s my tweet just a little bit of nice cuteness for the week
[Laughter] i really love that that’s awesome um so
my tweet of the week is a tweet by vicky boykies and i really like her twitter
um profile and she has a thread where she mentions that she wishes uh
she learned more computer science history in class and so there’s a bunch of resources
in the replies about computer science history and history of programming languages and some apocryphal stories so
i found it really interesting and i i thought that was nice to share she’s really good at
twitter yeah she is i like it
anyway i guess we’re ready to talk about pandas so pandas if you don’t know the project
aims to be the fundamental high-level building block for doing practical real-world data analysis in python
additionally it has a broader goal of becoming the most most powerful and flexible open source data analysis
manipulation tool available in any language it provides a fast and efficient data frame object for
data manipulation and integrated indexing tools for reading and writing data
between in-memory data structures and different formats and other features for data analytics and manipulation
it has about 27 000 stars on github and has had about 28 million
downloads last month across pipeyi and conda which is amazing
well the pressure is really on in maintaining this project there are so many users
um so it had to start somewhere uh yours could you kind of tell us where it started why it started and sort
of what needs it filled at the time please yeah um so pandas was
started by wes mckinney in in 2008 so already quite a long time ago and now the time
libraries that we now still know and use like numpy and scipy and network lip
they already existed and so you have you had already a scientific
python ecosystem um but a tool to to work with tabular a data structure
that didn’t really exist there are some like first attempts and and so that’s the gap that uh
pandas at that time filled so something uh for example our our data frame already
existed so people doing statistics of working with with tableau data might have used r or
stutter or spss so the statistical tools toolboxes of the world um but so that’s
what pandas provided for python working with these data frames um
and yeah and also providing a lot of functionality to work with this data frame i remember
during my master thesis that i was doing time series analysis um and just with play numpy and
uh yeah in the the years afterwards during my psd i started to use pandas and it was
yeah a big relief that some made it so much easier to work with
certain kinds of of data workflows and and a big boost in in
my productivity so um that’s the big the the yeah the reason why pandas
was started yeah i can imagine like if you have
messed with uh structure the race and numpy you’ll know that it hurts not easy indeed
can you explain a bit about the history of the name and logo of the project
um yeah it’s it’s the name itself uh it originally comes
from panel data i suppose that not many people associated or know that
or associated with it i’m also not very familiar myself with the the concept of panel data but it’s
it’s term used in in i think econometrics or for multi-dimensional data for example
for multiple variables over over time you have some measure certain things or
and pandas also had a had a panel object which was kind of a 3d
data frame um originally but in the meantime we actually removed
that because it’s yeah we thought it’s good to focus a little
bit panda’s already a huge project and and we decided to focus on the the classical a more typical 2d
table and not go into multi-dimensional uh tables um but that’s the
the origin uh for the for the name
well we know that pandas has a huge api uh so there’s tons of questions the
folks can ask but if anybody watching does have any questions please feel free to drop them in the chat and we’ll try to get to them
throughout the episode or uh at the end of the episode um we’ll definitely have a dedicated time for
that so yuris can you tell us a bit about what technology is built on
yeah um so actually most most part of pandas is is written in in
python but we have also some quite a big
considerable part is also written in cyto a little bit in c and so many of the um
fast custom algorithms that are implemented in pandas like kubai or uh certain time series uh moving we
know operations they are written in in saturn but of course we also rely on on numpy a
lot which is also uh written in c mostly um so but that’s a bit the technology uh
stack
oh pardon me uh so this is a big project uh and a lot of people are using it so
who maintains the project is there a large team around this um nowadays it’s it’s actually uh
starting to get uh quite a big team so it’s yeah it increased when i when i
started we were only with the a few core developers or maintainers
as we as we call it nowadays we are around 20 core developers and also a huge
number of yeah more uh casual contributors that uh pass by or
contribute for a while um i think in total we have more than 2000 contributors but yeah
that number is not very important if many of those people just do one contribution but the important
the more important thing is also that that we nowadays have quite a big set of
of people actively contributing i looked up a few numbers for example since the last release this
summer uh this summer um you had uh more than 20 people who did
at least five pull requests so it’s and so the um yeah it’s it’s uh
is a quite nice um team of people contributing um
mostly volunteering we have a few people now that are a little bit
supported financially we had last year the for the first time some uh some actual funding through the the
czi uh program some of the other uh python projects uh as well uh so we for example simon uh
hawkins who has been doing the last releases he’s paid partly uh with that grant to
to do the maintenance work he’s doing um but so the good thing is certainly that
we have i mean we start to have a bigger team um of course that also comes with like
a coordination or or there are still so many pull requests that you also need
to get those reviewed and people always like to do contributes rather than i actually write
code rather than reviews i mean it’s it’s uh it’s all already a good thing that we have a lot of
active people but there are still certainly many challenges with with maintaining such a large project
yeah i think that’s not a solved problem how do you onboard and keep maintainers around to sustain a project in a healthy way
that’s that’s a hard problem to solve and what communities are users and contributors of pandas from
um i think nowadays it’s it’s very diverse so originally for example wes
mckinney himself was more like from the finance world and statistics
he did the pg statistics and and worked in a finance company where he started pandas but and i
assume it’s still used a lot in in financial sector but yeah nowadays i think basically
everybody who works with data which is in in many fields and if they’re using
python they will often also use uh use pandas both in industry academia so it’s i think it’s quite a
variety of of backgrounds oh goodness everybody’s got data now so
if it wasn’t for pandas i wouldn’t be able to manage it i know that um so uh there’s a lot of contributors
large community is pandas uh participating in any specific uh diversity and inclusion efforts
uh as the community is growing um yeah so first to i mean
we certainly have to and we are very well aware of that as similar as
also as other projects but our um yeah the diversity in our uh core development team is is not
as it should be unfortunately um and and not only in our core team but i
think also in in general like general uh people contributing
um it it’s a it’s a difficult uh problem to solve but yeah we are
there are a few things that are that we’re yeah doing trying to to help for example
there was um we did a in 2018 uh there was a
worldwide documentation sprint uh coordinated by mark garcia it’s
from a lot of locations so uh not only like in the typical western countries but like 30 locations
worldwide we had more than 200 requests which was a really great success um there’s also the the panda nistas
which is a group trying to provide some mentorship and organizing sprints and
they are organizing a few sprints this year this year targeted at minorities using a small
development grant uh from the infocus we are also um thinking about
doing something for example on our twitchy intern i know that that in numpy they they did that as well but
yeah a big problem here is also finding yeah mentors or mentors that have time
because if you want to yeah mentor somebody well that takes a lot of time and since we’re
mostly volunteers it’s also uh that’s also a hard problem that uh yeah so
for sure we should try to do more but um yeah that’s a few things we’re we’re
doing nice yeah i think that’s part of the funding
problem as well and some some people are working on that and and some institutions and i think we have a
long way to go but some actions are important yeah uh now
we’re going to shift into the project demo and would like to see like real cool features of pandas
uh so while yaris is getting set up and sharing his screen uh we would like
to take this opportunity to thank our sponsor kwanzai for sponsoring this episode of open
source directions one side creating value from data so yours whenever you’re ready we are
excited to see your demo i really appreciated the honesty about the diversity inclusion efforts
it’s always an ongoing thing um and the steps forward some sound promising
do you see a jupiter lap notebook yes we do okay so um
i prepared a few notebooks to uh to demo some things uh did you mind doing two
things for us uh maybe uh losing the file browser and uh cranking up the
font a little bit perhaps yeah thank you that should already do that
awesome beautiful and i can also remove this um so um
this first notebook um yeah actually starts with we just uh
i i thought for at least for um a very basic intro if people are not
familiar with it and melissa you already introduced and provided a data frame so here you
see a small example of a data frame um what is typical about the data frame
certainly compared for example with numpy is that we have columns and columns have
a certain d type but we can also have columns with different data types so you have
heterogeneous data types and then a lot of functionality and that
we provide and i just gave a small example here like you can do a grouped um aggregations
like i take the mean of a column but could buy other columns and do some reshaping to see like how many
people survived the titanic uh accident uh so that’s a
yeah what pandas is um panasis is constantly being improved uh like
by a large team and we already talked about that um there are many new features and bug
features each release so way too many to to show
but i wanted to show two things from the the last uh releases um and that’s
one is um integration with number so number is a photo those not aware is
a just in time compiler um so can speed up python code by compiling it
on the fly um and it’s specifically targeted uh on for uh towards numerical code so
working with its arrays um and [Music]
it’s used i so it’s used a lot and nowadays in in the in the ecosystem but
we developed some integrations with pandas uh actually especially mature russia
did that the last releases so [Music] there are certain situations for example
when you do a coup by operation or a moving window operation so the the rolling function here is a moving
window operation where you want to apply a function on each group
so typically let me show here so if you do a
rolling window operation you can say okay i want to have a moving window of of 10 rows so
first 10 and then the second to the 11th row third to the 12th row etc
um and we provide a set of of very fast built-in
functions you can do for example i take for each window i take the sum for
example just a dummy example but those built-in operations are very fast
they are mostly implemented in sighting but sometimes uh you want to calculate
something that doesn’t fit into like isn’t just a sum or a mean and not so some function that
um is provided by pandas um and in those cases it can be useful to
uh that you want to write your own function and apply it um
you can typically call it a user-defined function so i made a small example here
uh it’s a very dummy example you could you could do this much more efficiently
but just for the example take some of the of the uh window and add five to it
um so if you do that i will already start it if you do that um on the same data you
can see in instead of uh being almost
instant uh so at 20 um 28 26 uh milliseconds
now it takes four seconds on this relatively small uh data set so because now it needs to
loop in and and call this python function on each of the windows and so it’s it’s
not easy to optimize that in in sighting because we each time we need to call this python function on each of the
pins um now what is the new feature is that there is
a number integration here so if your user-defined function
is a function that can be compiled by number then you have the option to say okay use
number to compile my function and to apply it to all of the
windows the first time it still takes some time it’s already faster but so
one of the um gotchas but number the first time you do you call a function it needs to compile it so it
takes a bit longer than a second time you do it so if i do it now a second time
it’s already much faster i will also time it so here you see we we start from
uh four seconds to apply a user defined function on each window in my moving window operation uh and we go to
using number we get down to yeah 150 plus minus something
milliseconds so a very nice speed up i think um so
the um number is used here uh to compile the actual function so the
user defined function but not only that it’s also internally used for so the the the full loop where we
uh loop over all the wind nodes pass the data so the the window into the function
collect the results so the that full loop is also uh compiled uh just in time
compiled using number and so that’s the the reason that it can provide this speed up so in general
you should still always first try to use the built-in functions if that’s possible but
if you need to write your own user-defined function then yeah this is a very nice
way to to be able to speed up your computations and this is a big
improvement too because the apply function was was can be slow sometimes right so
this definitely speeds it up yeah in general the the the general
rules always try to avoid apply and in in many cases it’s actually also
possible i think people sometimes like to quickly
go to apply to yeah while in many cases you can actually
use one of the built-in functions um which are optimized a lot
but so yeah if you need to use it then it’s a very nice speed up because in indeed in general
apply it’s very slow because yeah it’s a general it’s a generic function that it’s passed to panda so
pandas doesn’t really know what to it doesn’t really know how to optimize it it’s it’s by using number that we can
actually uh do this on the fly is this a bad time to interject i see uh
there was a question from paul hobson uh asking what raw equals false would do maybe i we’re getting ahead of the curve
here no no that’s a totally valid question the the reason i am
passing raw equals true here is actually to
whether you get past a series object or an actual numpy array and if i would
use files here and actually so passing series objects to my function it would even be
slower because it takes time to wrap each window into its own series and pass it
to the function but you could write in principle a function that expects a series
and uses spandex functionality but if you’re only using numpy functionality or
or in this so in in this case only numpy functionality you it’s a bit faster by using that and also
in case of of number uh numbers can work very well with uh numpy rays and understands numpy rays
so that’s the reason also that i’m using here the raw equals true
well that’s really cool and i guess um i guess speed is a the thing that you’re
worried about then like it’s this is the big thing of the next version like how do you see because this is also a
good time to get into the road map discussion maybe which is like what do you see for the future
and what’s coming ahead for pandas and what are you thinking of doing next
yeah um indeed so uh it’s it’s uh the the road map is
is um or or where pandas is going is also always a bit a difficult
topic um in the sense that so it’s we are a community
backed project so with a lot of contributors putting some time in it but it’s also
a very like a huge project so if you want to make substantial changes to pandas it’s
yeah it’s not an easy change it’s it’s a it’s a big change and it’s it’s very difficult to do that uh just
using um yeah based on on on people volunteering or so if you don’t have and it’s again a
bit related to uh to funding and if you don’t have like significant financial
backing it’s very difficult to say okay um we have vision to to to change this for
example make make pandas five times faster i think it would be possible um with a
lot of changes but it’s and we can’t have the ideas but it’s very hard to to make it uh yeah uh
to actually do it in practice um with uh yeah with many contributors uh
walking here and there on on the project uh that’s it we actually have a roadmap um
i can also show it here so it’s a page on the on the documentation um where
there are several uh topics so i i will try to and and depending on on on
the time allow so i will uh say something about the string data type um how it relates to
apache arrow also about missing values
that are a few of the topics that i i would like to talk about um so
i can maybe just start with the the string data
that would be really cool man the new pandas docs look so nice there’s they’re so nice to read yeah we
have actually adopted that in numpy as well because it’s it’s great to see uh numpy uh adopting
it as well so you can collaborate on it uh that’s great um
so string data um
i have here a notebook to say something about strings so but first a little bit of background about
string data um i suppose tony you said you you also teach panda so
something that always pops up if you are explaining pandas to new people is the data types and
um and something i at least i always need to explain that
we have some small data frame here and we have data with string columns here but if
you’re looking at the data types you see here objects and i most of the time
i i need to say yeah okay if you see object it actually means you have string data most of the time
typically um the reason that it’s object like the the technical background is that
numpy doesn’t really have support for a variable size string so
it’s numpy has a string data type but it’s for fixed size strings in patterns we want to support fireball
length strings so we use just an object numpy array which means you just put
python objects in the numpy array and the python objects here are in this
case strings but first it’s not very like
user friendly or not very explicit that this is actually a string column but also
it’s not strict about it so you’re never sure that it are actually only strings it could also
be a mixture of strings and numbers and other python
objects so what did we do and this is something that is already in the release it was really
included in 1.0 is a dedicated string data type
um i’m using here for a moment to convert d types it’s a helper function to
uh to convert some data types i will i will explain about it in the next notebook as well but so
most importantly uh what you now see here in the data types and so
compared with the objects that i have here now i actually have string so we have a string data type
um to show a little bit about it so i can also create it
manually specifying here i want string data type
and a difference compared to the default so the default is still
object d type but then you could actually just put a number in your object
um and if you would look at the values of this you can see it’s actually a mixture now of
numbers and strings so you don’t have strings anymore that’s something that uh specifically
uh having this string data type this allows so it will actually raise an error on if
you assign something it will raise an error that it’s saying i cannot set
a value that is not a string into this string series
[Applause] but [Applause] so here um
and but it’s mostly um just like a bit more user-friendly
interface around this object d type numpy array so if you look now
at the values or at least
if i ask for the numpy array it’s still under the hood it’s still
a numpy array of object d type we just wrapped it in our own string array
objects that add some constraints and ensure that actually everything that is there is a
string so implementation wise it’s more or less the same as
as we have always had for strings but the intent is much clearer and i think
certainly for for users it’s uh much more friendly to have an actual string data type than
having to explain what this object uh d-type is um so that’s something that is already
in in in the release it’s still experimental it’s not yet the default at some point we want to
make this the default but for the moment you still need to explicitly um yeah either open construction or
convert it to the to this data type
um that is very cool does the convert types function do stuff with
um daytime not yet uh but i i will i will come back on the
on on the function in the next notebook uh cool i will use it there as well
so mainly what i is mentioned here that i don’t want to convert integers and floating
point numbers because that’s also something that it can do but yeah next notebook
but to keep explaining something about the strings um so i was saying here it’s still the
same implementation now it’s still using those python objects
and this is actually not very efficient way to store strings in memory um
because each string is is wrapped in a python object so you you waste some memory that way um
they’re also not like aligned in memory so you can’t do a very efficient
algorithms in c so um something else that we have that is work
in progress um so we hope that it lands in 1.2 the upcoming release
it might also otherwise be in 1.3 is to have a native string data type using
apache arrow i will first show what it does
so i created some data some some random with some strings in it
i get the the column out of it i create the string data type as i
explained before but still using the object d type um and
something experimental the interface will still improve this is only development uh work
um the same data but now using an arrow memory representation
uh under the hood the first thing is that it actually has uh quite a bit better memory usage
um the reason for this is that in the arrow memory all the strings are just stored in one
long array like consecutive memory um while this is not the case for
the the python strings where each string is is a is an is its own python object
so this gives better better memory usage the exact difference will depend a lot on also on
the length of your strings but besides memory um it also gives
some faster operations at least the ones that we already implemented i’m showing a few of
them so for example we have the function lower
which converts your string in all lower case
if you’re doing that with uh with python maybe for the demo should have taken a a
little bit smaller data set it would go faster but so um
yeah you can see it takes on this data set two and a half seconds uh with the
default python strings and but only a half a second using uh arrow similar
there are a few other comparisons so here i’m just doing an equality uh
comparison checking where my string is equal to a1 and also here you see a very nice speed
up for this kind of operation and the last example is a is a contain
so a match so you want to check where um this a1 string is
matches some substring in in the in the full string so um and also here you can see that
um we obtain a nice uh speed up by using this is our memory
so how does this work i already explained so apache arrow has a more efficient
memory representation for this kind of of data and also implemented a lot of
kernels so computational uh algorithms to work on this memory representation
and written in c plus plus or in in compiled language um and in pandas we can then use uh
the pi arrow library to use the by our array to store this data and to
use those uh computational uh algorithms from uh from arrow um
it’s not only uh good news there are some uh drawbacks because you
store everything together in memory um you also there are a few things like set item operations
which are less efficient um to end this i want to
um so a few things so the this specific work was funded by czi
this is the i uh martin bredes has has been doing a lot of the work on the
arrow site related to those those algorithms like the the contains matching the the lower
casing etc um so is he using uh his vax product with
these string algorithms because i know that that that is a really awesome library
so it’s actually not directly so in pandas we are not using vex but so he
before wrote a lot of those arguments with invex and martin has been funded to port this to
arrow so that pandas can make use of those algorithms through arrow and
in the future then vex will actually also not have its own custom algorithms for for this
specific uh type of functions but also use apache arrow so but indeed it started he
he started working on that in fact but now we move this to apache ro so we could
all use this user’s algorithms
that’s really cool and and amazing to see that you can integrate stuff like that so
very interesting yeah and i think it’s also this is something that um
certainly will uh it’s we will try to do more in the future apache arrow it defines a standard
memory presentation for uh tabular data it’s used a lot in in already in in
variety of use cases also pandas use it for example to to read rk files um it’s used to
communicate data between uh pandas and pi spark or panels in r um a lot of big data tools are starting
to to interface with arrow so i think we also see more and more that in pandas we will
start using um functionality from from apache arrow
um given the time i certainly want to show my third notebook
about missing data um very short uh that’s also something
ongoing that is on the the roadmap it’s partly already implemented but as experimental feature partly not yet
um but so i created here a small data frame with some different columns uh inflowed
bull string and then time stamp column and as you can see so
my int column has actually become a float column because integers they don’t support
none or missing values in general so um if you have missing data in an
interior column it gets caught cost to float the same thing is actually true for
boolean column also booleans in numpy they don’t support uh something like nan and
pandas kept us as an object column and we can also see that um so for float and and int we are using
uh mp.none not a number floating point value uh but for string for object e type data
you can actually also use none and in in datetime data then we also have nat for not
a time um so it’s it’s a quite messy state uh that not all types
support missing uh missing values and the the missing value indicator
is also different depending on the on the type you have
so to try to improve this situation we have been working on what we call
nullable d types so um it’s a new feature
experimental feature in pandas 1.0 and i’m again using here this convert
dtypes function from before on my so this was my original data frame
and so what convert types does is it will look for each column and see if it can use one of those new
data types so either the string data type that already explained in the in the previous notebook or one of those nullable
interior float or boolean data types that are already supported so you can see
now i converted my data frame to use those d types and if i now i look at my data frame and
at my t types you can see i now actually have an integer column with a missing value
i now actually have a boolean column with the missing value um also my float column is now using the
same in missing value indicator so the pd.na
objects uh as well as in my string column so timestamp is not yet um
implemented but it’s also something that we want to do in the future to also in some way uh
make the the daytime data consistent with this um so that’s
one of the um yeah i think big road map items is to we have this new
missing value sentinel pd.na and we want this to provide the that you
can work with all the different data types i with missing values
consistently it’s very experimental so those new uh data types are um only
you won’t see them by default they’re opt-in but we hope that at some point they are yeah
fully passing the test suite for all functionality of panda so that we can switch
by default to use uh those data types so you all are like the standard for
data frames and there’s a large ecosystem that co-develops around pandas how do these changes
affect um the other communities um perhaps for uh maybe you could even tell
us some of the other communities that are dependent on these technologies um
yeah good question i’m going to stop my screen share so i uh
can just uh chat that in that way um yeah so there are certainly um a lot of
uh related uh projects uh to to pandas um there are
um some of them relying on pandas some of them are
providing more alternative i think many many of the projects they they try
to provide like like provide a similar um api as as pandas because people
and many people are familiar with the pandas api but they they try to then focus on some of the
areas where pandas is not very strong for example uh having better performance or being able
to distribute it or run it on gpu and and some examples here are auf dusk
data frame which uses pandas itself under the hood but
make that you can work with a partition data frame and run it in parallel or in a cluster in a distributed way
and you have in in the rapids project you have qdf to have a data frame on the
on the gpu there is for example also koalas which provides a pandas
similar api as pandas but under the hood using spark through pi spark um so there are
several of those projects that that provide a similar uh yeah
user experiences as pandas but bringing to uh into places where you where you can’t
currently uh use pandas directly
and i certainly i i there are more than than that for example fx is something we already mentioned it’s
vex is less like an identical pandas api but it’s also a data frame that
focuses more on on being able to to work on on big data
but on a single machine for example by use of memory mapping
and and having lazy or virtual columns and things like that it’s also using
apache arrow a lot under the hood to to use memory mapping of big data sets
you also have ibis that uh provides us a similar um api tony i think you are
familiar with that somewhat similar api as pandas but
by using different engines or back-ends under the hood for example if your data is in
in the database and you don’t you instead of getting it out of your database to run with pandas you can also let ibis
convert your pandas api i like uh call into a sql query and run it in
your sql database instead so there are yeah there are a lot of of different
alternative projects i think many of them are like focusing on on on an area where
pandas is um yeah is lacking uh for example the
running it’s head distributed or using a different uh engine under the hood um
and so that’s also i mean a good reason if you have a in it’s also good that this
ecosystem exists pandas doesn’t cover our use case so it’s good that there are those uh alternatives and and
uh yeah it’s uh it’s uh makes the makes it’s it’s what makes the python
ecosystem also very uh diverse and covers uh a lot of use cases
for for people i just love them it’s just such a huge thing and we call
it an ecosystem and then there’s pandas there’s koalas and there’s ibis and i’m wondering like
so many animals but it’s just i love it i i now i wonder if
eggs or stuff like that is also based off of animals but i don’t i don’t think so
yeah i i just want to be honest i i know that i previously thought about
if if we um like make a new version of pandas and instead of like breaking everybody’s code we can
use a new name i was already thinking of like good animal names that we could use but i want to take
something like pandas related you have different um like types of species of pandas but
yeah i wasn’t there yet yeah yes that requires some some
research um so i guess we’ve come to the point where we answer user questions
so we have a couple questions but there’s still time if you want to ask something please feel free uh to use
the live storm interface so i’ll just go to the first question which is
would contains plus regex equals true be faster so i’m guessing
this is from some of your notebook commands i don’t know yeah i showed yeah i showed a contains
operation um it’s actually typically slower because a simple
matching of a string the actual like characters of a string is something
easier uh cheaper to do than a full regular expression uh match um and in this case it’s also
because i didn’t need it i just had uh and i wanted to match a one
which wasn’t the regular expression so i i the default is actually to use it uh but often you don’t need it and it’s
a bit faster to uh turn it off actually in this case
eltani a munit oh no here i am i apologize um
sorry next question is from brendan ward uh the memory uh speed up of using arrow data types looks
great exclamation point i agree uh will this require optional dependencies for pandas or will it be a
built-in dependency um for now we are keeping it as an
optional dependency um and it’s it’s certainly a yeah something we need to discuss in the
future if you want to integrate more and more with apache arrow so
this is for the string data types but apache arrow could also provide other
data types for example they they also have decimals if you
now do something that’s not really not provided by pandas but you can in principle use python decimals
uh but so that’s another data type that um that arrow could provide a fast a
good memory representation and fast algorithms for um but also things like nested data types
like uh lists uh or or or dictionaries in in in value in in
cells of a data frame um so for yeah i think there was a lot of potential to like expand the the
feature set related to data types uh in pandas um but
the more we integrate with apache arrow the the question will keep popping up if you want to actually
just have a hard dependency on it and but for now it’s an optional dependency um if you don’t have
so at the moment the the steam data type will actually be opt-in you need to choose for it but in
um if you enable it by default it will still like see if arrow is installed then use it
and otherwise fall back to the um to the yeah default
python implementation so we have another question but i just want to
go back we had a follow-up from francis uh who made the first question and he just said sorry i meant faster
from using python strings so i don’t know if you want to comment on that quickly
um so i think the what i said about it being slower
using a regular expression it’s it’s um is the case for both using python or
using uh using arrow yeah great so i’ll just go to the last
question uh by david and he said i work in the geospatial world and use geopandas
almost daily what is the relationship between the geopandas and pandas core development groups
are they the same group different groups are these improvements in pandas going to be available in geopandas also
um yeah good question the the how is the relationship i would say
that the relationship is mostly me since i’m both a pandas core developer and a
geopandas core developer um which certainly ensures that there is a there is i mean
they are aware of each other and know what what they’re doing and it’s certainly useful for jio pandas
that that there is someone in in our team on the chip on the side with uh with
yeah a lot of knowledge what’s going on in in pandas um
but yeah we are there are other people working on jio panels as well that are not that involved in in in palace um
related to so all those things that i showed um are in principle also directly available
in in geopandas um because of the way how geopandas subclasses are pandas so it’s it adds a
few things by using by making it a subclass and so this the additional methods to work with your
special data but whenever you have a string column in a in a geopandas
uh data frame um it could also be this enhanced uh the the new string data type
just as it can be in in a pandas data frame so many of those those new features uh
developments in in pandas will also uh benefit uh geopanels
yeah that’s really cool and thank you so much for those demos it was really uh great to see the development being
done unfortunately we’re getting close to the end so now it’s time for our world famous rent
and rave section where we each get 15 seconds to rent a rave about whatever topic we want
so yours can go first um i want to um i’ve been recently
working on on with dusk and geopandas uh to try to
have them play together and to be able to to have geopandas and the special analysis also
potentially distributed or or parallelized um and i just want to say how fantastic
dusk actually is um it’s it’s a really nice project uh how they
enable a lot of other projects to to integrate and and and improve itself
it’s not just like we are dusk and we have a product no they they really integrate and enable a lot of
new things in in the full ecosystem which is which is great i think
oh that was great uh yeah so my uh you rem you know you forget what it’s like
to be a first time learner sometimes and i forgot how miserable date time was
when i first started learning python and how i don’t have those problems anymore with pandas so
i really just want to say if you’re a scientist and you’ve got daytime stuff just work in pandas life is a lot easier and
i forgot that that was like one of the main reasons i really latched on to pandas at the time
oh that’s cool i i i’m gonna rant and rave at the same time about time
because like 2020 has completely warped my sense of time and sometimes it looks like it’s going
really fast so we’re close to the end of the year but at the same time i feel like time is dragging
and it’s so weird so yeah i’m confused right now my brain is
is trying to understand time again i don’t know if y’all have the same feeling but it’s been a weird year
i can’t i can’t imagine what it’s like maintaining a date time package right now it’s all the time changes right
oh anyway uh thank you so much that’s all the time we have for today uh thanks
for watching thanks for listening and thank yours for being here uh i think that was pretty great um
you can find us on twitter at openteamsinc and at quonsiteai and yours where can
people find you and pandas so you can find pandas on twitter on with at pandas underscore deaf i think
and myself i’m yours video bossa it’s a bit difficult to write probably for most but uh there will be
some links on there yeah there’s a uh your handle is in the
chat so if people are on the chat they can see that um if you liked what you saw today please
go to our youtube channel and like and subscribe to see more of this content we look forward to you
joining us next episodes so just so we don’t repeat ourselves we’ll be talking about dry
python next episode thanks tony for that
that was fun thank you thanks for joining everyone
thank you all bye