Overview

Job Title:- Artificial Intelligence/Machine learning Engineer

Location:- Orlando, FL/ Glendale, Anaheim, CA/ Seattle, WA | ONSITE JOB

Duration:- 22 Months (Possible for extension)

Job description:-

AI/ML Operations
• Manage operational workflows for model deployments, updates, and versioning across GCP, Azure, and AWS
• Monitor model performance metrics: latency, throughput, error rates, token usage, and inference quality
• Track model drift, accuracy degradation, and performance anomalies—escalating to engineering as needed
• Support knowledge base operations including vector embedding pipeline health, chunk quality, and refresh cycles in Vertex AI
• Maintain model inventory and documentation across multi-cloud environments
• Coordinate model evaluation cycles with Responsible AI and Core Engineering teams

Agent & MCP Server Operations
• Monitor AI agent health, performance, and reliability (AutoGen-based agents, MCP servers)
• Track agent execution metrics: task completion rates, tool call success/failure, latency, and error patterns
• Support agent deployment and configuration management workflows
• Document agent behaviors, known issues, and operational runbooks
• Coordinate with Core Engineering on agent updates, testing, and rollouts
• Monitor MCP server availability, connection health, and integration status

FinOps & Cost Management
• Track and analyze AI/ML cloud spend across GCP (Vertex AI), Azure (OpenAI), and AWS (Bedrock)
• Build cost dashboards with breakdowns by model, application team, use case, and environment
• Monitor token consumption, inference costs, and embedding/storage costs
• Identify cost optimization opportunities—model selection, caching, batching, rightsizing
• Provide cost allocation reporting for chargeback/showback to consuming application teams
• Forecast spend trends and flag budget anomalies
• Partner with Infrastructure and Finance teams on AI cost governance

Monitoring, Dashboarding & Reporting
• Build and maintain dashboards for platform performance, model health, agent metrics, and operational KPIs
• Create executive and stakeholder reports on platform adoption, usage trends, and cost allocation
• Develop Responsible AI dashboards tracking hallucination rates, accuracy metrics, guardrail triggers, and safety incidents
• Monitor APIGEE gateway traffic patterns and API consumption trends
• Provide regular reporting to product management on use case performance

Release Operations Support
• Support release management processes with pre/post-deployment validation checks
• Track release health metrics for models, agents, and platform components
• Maintain release documentation, runbooks, and operational playbooks
• Coordinate with QA, Performance Engineering, and Infrastructure teams during releases

Responsible AI Operations
• Monitor guardrail effectiveness and flag anomalies to the Responsible AI team
• Track and report on hallucination detection, content safety triggers, and accuracy trends
• Support LLM Red Teaming efforts by collecting and organizing evaluation data
• Maintain audit logs and compliance documentation for AI governance

Cross-Functional Coordination
• Serve as operational point of contact for application teams consuming DxT AI APIs
• Coordinate with Corporate Security on audit requests and compliance reporting
• Partner with Infrastructure team on capacity tracking and resource utilization
• Support Performance Engineering with load test analysis and results documentation

Basic Qualifications:-
• 2-4 years in an Ops, Analytics, or Technical Operations role (MLOps, AIOps, DataOps, Platform Ops, or similar)
• Understanding of AI/ML concepts: models, inference, embeddings, vector databases, LLMs, tokens, prompts
• Experience with cloud cost management and FinOps—tracking, analyzing, and optimizing cloud spend Strong proficiency with dashboarding and visualization tools (Looker, Tableau, Grafana, or similar)
• Working knowledge of GCP (required); familiarity with Azure and AWS a plus
• Comfortable with SQL and basic Python for data analysis and scripting
• Experience with monitoring and observability platforms (Datadog, Prometheus/Grafana, Cloud Monitoring, or similar)
• Understanding of APIs and API gateways ability to read logs, trace requests, analyze traffic
• Strong analytical and problem-solving skills with attention to detail
• Excellent communication skills able to translate technical metrics into stakeholder insights
• College degree in Computer Science, BIS, MIS, EE, ME or similar is required

Preferred Qualifications
• Hands-on experience with LLM platforms: Vertex AI, Azure OpenAI, AWS Bedrock
• Familiarity with AI agents and agentic architectures (AutoGen, LangChain, or similar)
• Exposure to MCP (Model Context Protocol) or agent-tool integration patterns
• Experience with vector databases and RAG (Retrieval-Augmented Generation) operations
• Understanding of MLOps lifecycle: model registry, versioning, deployment patterns, A/B testing
• Experience with APIGEE or similar API management platforms
• Familiarity with Responsible AI metrics hallucination, bias, content safety, guardrails
• FinOps certification or formal cloud cost management experience
• Experience supporting enterprise platform teams with multiple consuming applications

Nice to Have
• Familiarity with ML pipeline tools (Kubeflow, MLflow, Vertex AI Pipelines)
• Exposure to prompt management and evaluation frameworks
• ITIL or operational process framework experience
• Experience creating runbooks and operational documentation

Required Education:- Bachelor’s Degree in CS, BIS, MIS, EE, ME or similar

Feel free to forward my email to your friends/colleagues who might be available. We do offer a referral bonus! Thank you for your time and consideration. I am looking forward to hearing from you.

Company:

Integrated Resources, Inc ( IRI )

Qualifications:

Language requirements:

Specific requirements:

Educational level:

Level of experience (years):

Senior (5+ years of experience)

Tagged as: , , , , ,