Talent.com
This job offer is not available in your country.
Solutions Architect

Solutions Architect

ConfidentialHyderabad / Secunderabad, Telangana, India
9 days ago
Job description

AI / ML Solution Architect

About Us :

Headquartered in Sunnyvale, with offices in Dallas & Hyderabad, Fission Labs is a leading software development company, specializing in crafting flexible, agile, and scalable solutions that propel businesses forward.With a comprehensive range of services, including product development, cloud engineering, big data analytics, QA, DevOps consulting, and AI / ML solutions, we empower clients to achieve sustainable digital transformation that aligns seamlessly with their business goals.

Fission Labs Website : https : / / www.fissionlabs.com /

Key Responsibilities :

Architecture & Infrastructure

  • Design, implement, and optimize end-to-end ML training workflows including infrastructure setup, orchestration, fine-tuning, deployment, and monitoring.
  • Evaluate and integrate multi-cloud and single-cloud training options across AWS and other major platforms.
  • Lead cluster configuration, orchestration design, environment customization, and scaling strategies.
  • Compare and recommend hardware options (GPUs, TPUs, accelerators) based on performance, cost, and availability.

Performance & Optimization

  • Conduct performance benchmarking, hardware comparisons, and cost-performance trade-off analysis.
  • Implement real-time monitoring and control systems with metrics collection, observability, and custom performance tracking.
  • Optimize cost models, budget predictability, and resource utilization.
  • Data & Training Pipelines

  • Architect and validate data pipelines with storage, persistence, and throughput optimization.
  • Oversee data quality validation, pre-processing, and long-term experiment tracking.
  • Support framework flexibility for diverse training techniques (supervised, unsupervised, fine-tuning, reinforcement learning).
  • Integration & Deployment

  • Ensure seamless deployment across multi-cloud environments with security, compliance, and regional availability considerations.
  • Collaborate with DevOps and MLOps teams for automation, fault tolerance, job scheduling, and orchestration testing.
  • Provide technical guidance on integration with existing enterprise systems.
  • Analysis & Recommendations

  • Lead result analysis, insight generation, and actionable recommendations for training performance and user experience improvements.
  • Present performance claims, benchmarking reports, and speculative decoding insights to stakeholders.
  • Technical Expertise Requirements

    Technical Expertise

  • 10+ years in architecture roles with at least 5 years in AI / ML infrastructure and large-scale training environments.
  • Expert in AWS cloud services (EC2, S3, EKS, SageMaker, Batch, FSx, etc.) and familiar with Azure, GCP, and hybrid / multi-cloud setups.
  • Strong knowledge of AI / ML training frameworks (PyTorch, TensorFlow, Hugging Face, DeepSpeed, Megatron, Ray, etc.).
  • Proven experience with cluster orchestration tools (Kubernetes, Slurm, Ray, SageMaker, Kubeflow).
  • Deep understanding of hardware architectures for AI workloads (NVIDIA, AMD, Intel Habana, TPU).
  • Performance & Cost Management

  • Demonstrated expertise in performance benchmarking, reliability testing, and training speed optimization.
  • Skilled in cost modeling, budget forecasting, and cost-performance balancing.
  • Monitoring & Observability

  • Experience with real-time monitoring tools (Prometheus, Grafana, CloudWatch) and custom metric instrumentation.
  • Familiarity with network performance testing, regional load testing, and multi-region deployment strategies.
  • Soft Skills

  • Strong problem-solving skills with an analytical mindset.
  • Excellent communication skills to present technical trade-offs and strategic recommendations to executives and engineering teams.
  • Ability to lead cross-functional teams and drive innovation in AI infrastructure.
  • We Offer :

  • Opportunity to work on technical challenges with global impact.
  • Vast opportunities for self-development, including online university access and sponsored certifications.
  • Sponsored Tech Talks & Hackathons to foster innovation and learning.
  • Generous benefits package including health insurance, retirement benefits, flexible work hours, and more.
  • Supportive work environment with forums to explore passions beyond work. This role presents an exciting opportunity for a motivated individual to contribute to the development of cutting-edge solutions while advancing their career in a dynamic and collaborative environment.
  • Show more

    Show less

    Skills Required

    S3, Prometheus, Grafana, Intel, Tensorflow, Pytorch, Ec2, Batch, AMD, Cloudwatch, Kubernetes

    Create a job alert for this search

    Solution Architect • Hyderabad / Secunderabad, Telangana, India