Model Migration Process
Introduction
LLM models are constantly evolving, and GitLab needs to regularly update our AI features to support newer models. This guide provides a structured approach for migrating AI features to new models while maintaining stability and reliability.
Purpose
Provide a comprehensive guide for migrating AI models within GitLab.
Expected Duration
Model migrations typically follow these general timelines:
-
Simple Model Updates (Same Provider): 2-3 weeks
- Example: Upgrading from Claude Sonnet 3.5 to 3.6
- Involves model validation, testing, and staged rollout
- Primary focus on maintaining stability and performance
- Can sometimes be expedited when urgent, but 2 weeks is standard
-
Complex Migrations: 1-2 months (full milestone or longer)
- Example: Adding support for a new provider like AWS Bedrock
- Example: Major version upgrades with breaking changes (e.g., Claude 2 to 3)
- Requires significant API integration work
- May need infrastructure changes
- Extensive testing and validation required
Timeline Factors
Several factors can impact migration timelines:
- Current system stability and recent incidents
- Resource availability and competing priorities
- Complexity of behavioral changes in new model
- Scale of testing required
- Feature flag rollout strategy
Best Practices
- Always err on the side of caution with initial timeline estimates
- Use feature flags for gradual rollouts to minimize risk
- Plan for buffer time to handle unexpected issues
- Communicate conservative timelines externally while working to deliver faster
- Prioritize system stability over speed of deployment
Scope
Applicable to all AI model-related teams at GitLab. We currently support using Anthropic and Google Vertex models. Support for AWS Bedrock models is proposed in issue 498119.
Prerequisites
Before starting a model migration:
- Create an issue under the AI Model Version Migration Initiative epic with the following:
- Label with
group::ai framework
- Document any known behavioral changes or improvements in the new model
- Include any breaking changes or compatibility issues
- Reference any model provider documentation about the changes
- Label with
-
Verify the new model is supported in our current AI-Gateway API specification by:
- Check model definitions in AI gateway:
- For LiteLLM models:
ai_gateway/models/v2/container.py
- For Anthropic models:
ai_gateway/models/anthropic.py
- For new providers: Create a new model definition file in
ai_gateway/models/
- For LiteLLM models:
- Verify model configurations:
- Model enum definitions
- Stop tokens
- Timeout settings
- Completion type (text or chat)
- Max token limits
- Testing the model locally in AI gateway:
- Set up the AI gateway development environment
- Configure the necessary API keys in your
.env
file - Test the model using the Swagger UI at
http://localhost:5052/docs
- If the model isn’t supported, create an issue in the AI gateway repository to add support
- Review the provider’s API documentation for any breaking changes:
- Check model definitions in AI gateway:
- Ensure you have access to testing environments and monitoring tools
- Complete model evaluation using the Prompt Library
Migration Tasks
Migration Tasks for Anthropic Model
- Optional - Investigate if the new model is supported within our current AI-Gateway API specification. This step can usually be skipped. However, sometimes to support a newer model, we may need to accommodate a new API format.
- Add the new model to our available models list.
- Change the default model in our AI-Gateway client. Please place the change around a feature flag. We may need to quickly rollback the change.
- Update the model definitions in AI gateway following the prompt definition guidelines Note: While we’re moving toward AI gateway holding the prompts, feature flag implementation still requires a GitLab release.
Migration Tasks for Vertex Models
Work in Progress
Feature Flag Process
Implementation Steps
For implementing feature flags, refer to our Feature Flags Development Guidelines.
Model Selection Implementation
The model selection logic should be implemented in:
- AI gateway client (
ee/lib/gitlab/llm/chain/requests/ai_gateway.rb
) - Model definitions in AI gateway
- Any custom implementations in specific features that override the default model
Rollout Strategy
- Enable the feature flag for a small percentage of users/groups initially
- Monitor performance metrics and error rates using:
- Sidekiq Service dashboard for error ratios and response latency
- AI gateway metrics dashboard for gateway-specific metrics
- AI gateway logs for detailed error investigation
- Feature usage dashboard for adoption metrics
- Periscope dashboard for token usage and feature statistics
- Gradually increase the rollout percentage
- If issues arise, quickly disable the feature flag to rollback to the previous model
- Once stability is confirmed, remove the feature flag and make the migration permanent
For more details on monitoring during migrations, see the Monitoring and Metrics section below.
Scope of Work
AI Features to Migrate
-
Duo Chat Tools:
-
ci_editor_assistant/prompts/anthropic.rb
- CI Editor -
gitlab_documentation/executor.rb
- GitLab Documentation -
epic_reader/prompts/anthropic.rb
- Epic Reader -
issue_reader/prompts/anthropic.rb
- Issue Reader -
merge_request_reader/prompts/anthropic.rb
- Merge Request Reader
-
-
Chat Slash Commands:
-
refactor_code/prompts/anthropic.rb
- Refactor -
write_tests/prompts/anthropic.rb
- Write Tests -
explain_code/prompts/anthropic.rb
- Explain Code -
explain_vulnerability/executor.rb
- Explain Vulnerability
-
-
Experimental Tools:
- Summarize Comments Chat
- Fill MR Description