Week 1 // The week that has been @ 2048

Week #1 – 21/05 to 27/05 
In the last meeting, my mentors and I decided upon the mini-project that I suggested. Here’s a brief overview of what I decided to work with over the course of the last week.

Main Steps description

  1. Will use scrapy to scrape data from the given website – https://amity.edu/placement 
data = {
    'Link': 'https://amity.edu/placement/Popup.asp?Eid=3895',
    'name': 'Impetus - Recruitment Opportunity For 2019 Batch (Apply Now)',
    'year': 2019
}

2. Use data validation tools to filter out data

  • Schematics
  • JSON Schema
  • Cerberus (Would be working, at the end of GSoC)

3. Same project spec for each tool to fully try them out
4. Store data in PostgreSQL, this is for sake of completion 
5. Present data on a website, possible using ReactJS

All in all this project will greatly help me develop some good insight on the validation tools popularly used at ScrapingHub and how they work. Coming back to the 3 main questions that we have. 

What did you do this week? 

Well, for starters I am writing this blog post again. Due to some bug, my original post wasn’t saved. But, no regrets. These are blogs are important and should be written even if I have to write to them again.

My week has been busy with this mini-project: 

  • Studied the Schematics validation pipeline, implemented it in my mini project and work out a small bug of the documentation. So, good progress. 
  • Implemented the JSON Schema validation, ran through the tutorial to understand the various properties and features. Quite powerful. 
  • Cerberus will take some time to implement, still, need to research the best way to go about it. 

This project is an ongoing thing. As when it gets finished, it would really help me with the development of the Cerberus pipeline whenever that gets completed. I also have been reading about PostgreSQL pipeline for Scrapy and learned new things.

I also went to Google Summer of Code meetup in New Delhi to meet and network with other GSoC’ers here. It was a good time. 

What is coming up next? 

Next up, I am working on 2 PR’s and fixing an issue related to the Slack actions that have been opened for quite a while. I will also be working to code a draft pipeline of Cerberus, to figure out what goes where. This will be a big Lego project with small parts that need to be stuck together to give a better picture. Looking forward to it. 

I am also working towards a better issue tracking for my project through the Github project and improving the documentation of Spidermon. 

Did you get stuck anywhere?

I did, regarding the JSON schema validation implementation. I researched the issue, found several solutions and ran it through my mentors. Turns out the implementation is not listed in the documentation. Will add that too. Busy week ahead. 

That’s that from my side, this is Vipul Gupta signing out. 

Leave a Reply

Your email address will not be published. Required fields are marked *