My Project’s Take Away from PyCon’s Pandas session by S.Anand

I being a regular rails guy since last 5 years, my appearance at the Pandas session was exclusively a project related one. And I already had hear high about the speaker of the session S.Anand of Gramener. My knowledge with python was of working for 6 months, that too around 5 and a half years back. So, I consider myself as a beginner when it comes to python.
 
The Project that I was working with had huge Data ETA (Extract Transform and Load) running on ~350 databases, with ~40 tables in each. The average rows count of visits table alone accounting to nearly 2 million rows each in each database. My project deals with moving all these data into one single database. Lots of number crunching has already gone into the population of visits table, and the data in some of the tables dates back to 5-6 years. The new implementation of visits are so complex that the number crunching pattern itself has been improved. So huge Transformation is manifested.
 
Above everything, we are running on a tight ETA, and very short window of down time for the migration.
 
My team had already put everything aside and started the work. But, I wanted to explore if we are missing any better implementation by any means. And that brought me to pyCon.
 
Talking about a minor complicated scenario. Where earlier we had a single table, now we have divided the columns in the same table into two different tables with a has one relationship between them. And the rows in the table are moved on to three different tables. It is evident that there will be various changes that was introduced to depict this big difference in the ER(Entity relationship).
 
(OLD) TableABCDetails 
 
(NEW)
 
TableA \
TableB – DetailsHasOneTable
TableC /
 
What makes Panda a Perfect fit for our need ?
 
Panda has got read/write(IO) facility for CSV, JSON, EXCEL, HDF, STATA, HTML, CLIPBOARD, PICKLE and our own SQL.
 
The session though was using CSV as the data source. The fantastic ORM and DSL provided by Pandas, will only want us to know a single difference
 
which is,
 
read_csv(‘…’) vs read_sql(‘…’)
to_csv(‘…’) vs to_sql(‘…’)
 
 
Updating a new Column with data, or adding a new column with data derived through an equation is a single step. No iteration involved, at least in the developers code. These all are again going to go into mysql as sql bulk create.
 
Complex calculations turns into a simple one-liner this way.
 
…a lot more of it’s feature will be covered in my later posts…
 
Implementation
 
Get the database dump, and load it to Pandas, Do the Transformation, and them export as Sql. And source the Sql output from inside the target database. tada.
 
 
What all are the technical gains ?
 

Py has got a fantastic GC(garbage collector) compared to ruby, If I’m not wrong.

 
Py execution is fast compared to ruby.
 
Directly doing a bulk update on sql, than updating by iteration under for…loop in ruby using Active::model create method is much faster. And consumes very less memory. Will provide a Bench mark later in my next post.
 
The overhead of having huge data in memory is also less when run by pandas over ruby on rails. As Anand mentioned, pandas is the fastest and efficient way of dealing with Big-data with python, and probably with many other languages. Any way, my experience at the work shop confirms, that it is brilliantly fast.
 
In terms of code development time,
 – since no for…loops and similar interpolations are done the traditional way, that saves the dev a lot of time while coding the data transformation part.
 
In terms of process execution time,
 – The absence of ruby For…loop itself adds efficiency, and python has got better underlying C implementation.
 – Above all of that, In pandas, a lot of thoughts by data scientists and mathematics evangelists have been put in, making it cutting edge at what it does. It exclusively does handle big-data manipulation. It is born to maneuver custom code developed by a regular web/software developer.
 
 
…Currently our ETA doesn’t support such a huge change in technology usage, but I’m so impressed(and resting high hopes) with this that I will definitely make time to re-implement my project with Pandas/Python and probably that might get Acceptance too, seeing the whole lot of advantages this method holds.
 
All of those will be another post for sure…
 
Thanks,
 
A big thanks to Anand for his patience, brilliant presentation and technical skills. His knowledge is vast, and his humbleness is deep. I’m impressed by your talk.
 
Image