Role Overview
The role focuses on designing, developing, and optimizing large-scale data processing solutions using Spark Scala and Hadoop ecosystem technologies. The position requires strong expertise in big data components, distributed processing, SQL optimization, and end-to-end pipeline development in both batch and streaming environments.
Key Responsibilities
- Create Spark Scala jobs for data transformation, aggregation, and large-scale data processing
- Design and implement data processing pipelines using Hadoop ecosystem tools such as HDFS, Hive, YARN, MapReduce, and Sqoop
- Write and optimize Spark jobs, Spark SQL queries, and streaming / batch data processing flows
- Develop and optimize complex Hive and SQL queries involving UDFs, joins, views, and large datasets
- Debug Spark code and enhance performance for distributed applications
- Utilize UNIX commands and shell scripting for automation and environment handling
- Work with Autosys and Gradle for job scheduling and build management
- Produce unit tests for Spark transformations and associated helper methods
- Write clear Scaladoc-style documentation for all developed code
- Collaborate with SMEs and stakeholders to meet timelines and ensure accurate status reporting
- Create and maintain detailed documentation for developed mappings and processes
- Work effectively within an agile environment
Required Experience & Skills
Minimum 5+ years of experience in Spark Scala developmentStrong experience with Hadoop ecosystem components (HDFS, Spark, Hive, Parquet, YARN, MapReduce, Sqoop)Experience with batch and streaming data processingStrong SQL and Hive query optimization skillsExperience in debugging and performance tuning Spark applicationsKnowledge of UNIX commands and shell scriptingHands-on experience with Autosys and GradleStrong analytical and problem-solving abilitiesAbility to work with multiple teams, manage timelines, and maintain documentationSkills Required
Hadoop, hdfs , Hive, Yarn