[Integrate Cerberus] Final Work Report @ Google Summer of Code 2019

It’s 4 O’clock of a rainy Friday morning and one more Google Summer of Code has ended. My second to be exact. Time for me to hang in my boots, and start writing another report. Probably my last on this subject matter. There’s a lot to write.

Google Summer of Code is a global program focused on bringing more student developers into open source software development. Students work with an open source organization on a 3 month programming project. And this year I worked with The Scrapy Project (Under Python Software Foundation) on the project to Integrate Cerberus.

I have known and participated in it since early 2016. This year, I helped 9 people of my community, ALiAS get into the program as well. If some interviewer going to ask me the proudest moment of my life, then that would be close to it. Here, read more about it.

I always like to contribute to FOSS, and GSoC gives students the perfect opportunity to work with the best organisations in the world, build their skills and get paid for their efforts. I wrote a comprehensive guide post on it, by answering every question I have ever been asked about it since 2016.

So, you know this year was a tad bit special for me.

Introduction

The Scrapy Project is a free and open-source software web-scraping framework written in Python. Towards the beginning of this new and unique framework, it was made to be utilized for web scraping and data gathering tool, presently it is utilized to remove information utilizing APIs or as a broadly useful web crawler or Spider.

Spidermon is a recommended tool for monitoring spiders created using Scrapy. Currently, our user can choose between two libraries for item validation rules: jsonschema and schematics. We want to provide a third
option that being Cerberus.

Cerberus provides powerful yet simple and lightweight data validation functionality out of the box and is designed to be easily extensible, allowing for custom validation. It has no dependencies and is thoroughly tested on several Python versions.

Objective

To build, test, and enable Cerberus’s validation method in the already existing item validation available for the user in the spider monitoring framework, called Spidermon as part of Scrapy web crawling framework. 

Stack Used

Small glimpse of technologies that we worked with:

  1. Programming Language:​ Python 3.6, using Pytest, Python Packaging, virtualfish, unittest, Scrapy, Spidermon, Cerberus and more.
  2. Documentation generation: reStructuredText using Sphinx.
  3. Code styling​ & VCS – ​Black, PEP8, Git, GitHub.

Results and Work Done

For the aim of having less jargon, here’s how the project has been divided.

  1. Validate() -Import Cerberus as a new sub-package available to be used in the validation package along with JSONschema and Schematics. Start by implementing the validate() Method. (PR)
  2. Translate() – Now, when the validate method() is complete. One would need the errors from the new packages to be parsed properly so as to have uniformity all over the API. Hence, to accomplish that we make a translator that does just that.  (PR)
  3. Now, as our tool is now nearly complete in its implementation, we move over to making it available in the toolbox (that is the ItemValidationPipeline()). Pipelines are where data can be validated, and schemas are passed through a network of scripts to the tool that works on them and produces the needed output.  (PR) (Read More)
  4. When integration is complete, then we move over to the crucial steps of refactoring, optimization of each module, script, and data file that will be used before pushing to production. Also, necessary documentation (PR)
  5. Packaging/shipping to PyPi (Done when this gets merged – https://github.com/scrapinghub/spidermon/pull/201)
  6. One blog ​each week​ regarding Spidermon, my project and my experience, learning through the project on ​Mixster​.
  7. For the community to track progress, a ​tracker was maintained with my latest developments containing week-to-week updates​, and MoM of mentor meetings​. This helps to maintain accountability​, transparency and keeping track.​

My Journey

What did you love about working with ScrapingHub?

The thing that I truly loved about ScrapingHub is the feeling of working remotely, with necessary amount of discipline, commitment and dedication towards my work. Google Summer of Code provides us with a great opportunity to truly improve upon on our work, skills and push us out of our comfort zone. I feel great, being able to learn so many things on the fly as well as getting guidance from my awesome mentors, Renne Rocha and Júlio César Batista from ScrapingHub.

If there was work that needs to be done for the week, I would often do it on the weekends as who loved chasing deadlines under pressure. But all that changed, as now I work consistently throughout the week, pushing commits, refactoring code and learning new things from my time in Google Summer of Code. This has been a great change in my work ethic.

The Scrapy Project and ScrapingHub has been great part of me over the last 6 months. I have been getting some unique, innovative challenges to work and wrap my head around. Lately, have resolved my shortcomings related to communication as well as the work that was needed to be done.

I don’t work that hard, I see to be having more time, as I distribute my work evenly over the day, still write a lot of blogs, break down my tasks into smaller bits and look for feedback wherever possible. Life’s good working with ScrapingHub.  

What GSoC taught me in terms professionally and mentally? 

Going through each work period helped me realize that things are almost almost never as simple as it seems. The more and more time I spent reading the code, documentation, trying to build a bigger picture in my head. The more I understood how big of a task I am undertaking, this also helped in reassessing the time as well recalibrate the effort that was being put into it. 

I learned about debugging, testing, documentation, module management, python packaging, absolute and relative imports. Defaultdicts, __new__, list comprehensions, code readability, code coverage, logging, and tons of best practices. I am looking forward to learning even more, faster. Leveling up my Python, one leap at a time. This all helped me become a better programmer, not just by code I write. But, also inculcating the right discipline to work remotely.

Conclusion

I have no words to describe how grateful I am to my mentors – Renne Rocha and Júlio César Batista – for all their support and for being extremely responsive and helpful with every one of my problems. I am also thankful to the community of The Scrapy Project, ScrapingHub and Google for the opportunity to work on this project, which helped me learn a lot over the course of 3 months. Looking forward to contributing more to Scrapy’s codebase and thank you for all trusting me and giving me this opportunity to work.

Leave a Reply

Your email address will not be published. Required fields are marked *