Role and responsibilities
- Understands the process flow and the impact on the project module outcome.
- Works on coding assignments for specific technologies basis the project requirements and documentation available
- Debugs basic software components and identifies code defects.
- Focusses on building depth in project specific technologies.
- Expected to develop domain knowledge along with technical skills.
- Effectively communicate with team members, project managers and clients, as required.
- A proven high-performer and team-player, with the ability to take the lead on projects.
- Design and create S3 buckets and folder structures (raw, cleansed_data, output, script, temp-dir, spark-ui)
- Develop AWS Lambda functions (Python / Boto3) to download Bhav Copy via REST API and ingest into S3
- Author and maintain AWS Glue Spark jobs to :
– partition data by scrip, year and month
– convert CSV to Parquet with Snappy compression
Configure and run AWS Glue Crawlers to populate the Glue Data CatalogWrite and optimize AWS Athena SQL queries to generate business-ready datasetsMonitor, troubleshoot and tune data workflows for cost and performanceDocument architecture, code and operational runbooksCollaborate with analytics and downstream teams to understand requirements and deliver SLAsTechnical skills requirements
The candidate must demonstrate proficiency in,
3+ years’ hands-on experience with AWS data services (S3, Lambda, Glue, Athena)PostgreSQL basicsProficient in SQL and data partitioning strategiesExperience with Parquet file formats and compression techniques (Snappy)Ability to configure Glue Crawlers and manage the AWS Glue Data CatalogUnderstanding of “serverless” architecture and best practices in security, encryption and cost controlGood documentation, communication and problem-solving skillsNice-to-have skills
SQL DatabaseExperience in Python (or “Rubby”) scripting to integrate with AWS servicesFamiliarity with RESTful API consumption and JSON processingBackground in financial markets or working with large-scale time-series dataKnowledge of CI / CD pipelines for data workflows