Job Responsibilities -
- Architect and implement a scalable, offline Data Lake for structured, semi-structured, and unstructured data in an on-premises, air-gapped environment.
- Collaborate with Data Engineers, Factory IT, and Edge Device teams to enable seamless data ingestion and retrieval across the platform.
- Integrate with upstream systems like MES, SCADA, and process tools to capture high-frequency manufacturing data efficiently.
- Monitor and maintain system health, including compute resources, storage arrays, disk I / O, memory usage, and network throughput.
- Optimize Data Lake performance via partitioning, deduplication, compression (Parquet / ORC), and implementing effective indexing strategies.
- Select, integrate, and maintain tools like Apache Hadoop, Spark, Hive, HBase, and custom ETL pipelines suitable for offline deployment.
- Build custom ETL workflows for bulk and incremental data ingestion using Python, Spark, and shell scripting.
- Implement data governance policies covering access control, retention periods, and archival procedures with security and compliance in mind.
- Establish and test backup, failover, and disaster recovery protocols specifically designed for offline environments.
- Document architecture designs, optimization routines, job schedules, and standard operating procedures (SOPs) for platform maintenance.
- Conduct root cause analysis for hardware failures, system outages, or data integrity issues.
- Drive system scalability planning for multi-fab or multi-site future expansions.
Essential Attributes (Tech-Stacks) -
Hands-on experience designing and maintaining offline or air-gapped Data Lake environments.Deep understanding of Hadoop ecosystem tools : HDFS, Hive, Map-Reduce, HBase, YARN, zookeeper and Spark.Expertise in custom ETL design, large-scale batch and stream data ingestion.Strong scripting and automation capabilities using Bash and Python.Familiarity with data compression formats (ORC, Parquet) and ingestion frameworks (e.g., Flume).Working knowledge of message queues such as Kafka or RabbitMQ, with focus on integration logic.Proven experience in system performance tuning, storage efficiency, and resource optimization.Qualifications -
BE / ME in Computer science, Machine Learning, Electronics Engineering, Applied mathematics, Statistics.Desired Experience Level -
4 Years relevant experience post Bachelors2 Years relevant experience post MastersExperience with semiconductor industry is a plusSkills Required
Python, Data Lake, Hadoop, Hbase, Yarn, Zookeeper