Troubleshooting Geo synchronization

Tier: Premium, Ultimate Offering: Self-managed

Reverify all uploads (or any SSF data type which is verified)

  1. SSH into a GitLab Rails node in the primary Geo site.
  2. Open Rails console.
  3. Mark all uploads as “pending verification”:
caution
Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.

Reverify all uploads

Upload.verification_state_table_class.each_batch do |relation|
  relation.update_all(verification_state: 0)
end

Reverify failed uploads only

Upload.verification_state_table_class.where(verification_state: 3).each_batch do |relation|
  relation.update_all(verification_state: 0)
end

How the reverification process works

When you reverify all uploads or reverify failed uploads only:

  1. This causes the primary to start checksumming the Uploads depending on which commands were executed.
  2. When a primary successfully checksums a record, then all secondaries recalculate the checksum as well, and they compare the values.

You can perform a similar operation with other the Models handled by the Geo Self-Service Framework which have implemented verification:

  • LfsObject
  • MergeRequestDiff
  • Packages::PackageFile
  • Terraform::StateVersion
  • SnippetRepository
  • Ci::PipelineArtifact
  • PagesDeployment
  • Upload
  • Ci::JobArtifact
  • Ci::SecureFile
note
GroupWikiRepository is not in the previous list since verification is not implemented. There is an issue to implement this functionality in the Admin area UI.

Message: Synchronization failed - Error syncing repository

caution
If large repositories are affected by this problem, their resync may take a long time and cause significant load on your Geo sites, storage and network systems.

The following error message indicates a consistency check error when syncing the repository:

Synchronization failed - Error syncing repository [..] fatal: fsck error in packed object

Several issues can trigger this error. For example, problems with email addresses:

Error syncing repository: 13:fetch remote: "error: object <SHA>: badEmail: invalid author/committer line - bad email
   fatal: fsck error in packed object
   fatal: fetch-pack: invalid index-pack output

Another issue that can trigger this error is object <SHA>: hasDotgit: contains '.git'. Check the specific errors because you might have more than one problem across all your repositories.

A second synchronization error can also be caused by repository check issues:

Error syncing repository: 13:Received RST_STREAM with error code 2.

These errors can be observed by immediately syncing all failed repositories.

Removing the malformed objects causing consistency errors involves rewriting the repository history, which is usually not an option.

To ignore these consistency checks, reconfigure Gitaly on the secondary Geo sites to ignore these git fsck issues. The following configuration example:

The Gitaly documentation has more details about other Git check failures and earlier versions of GitLab.

gitaly['configuration'] = {
  git: {
    config: [
      { key: "fsck.duplicateEntries", value: "ignore" },
      { key: "fsck.badFilemode", value: "ignore" },
      { key: "fsck.missingEmail", value: "ignore" },
      { key: "fsck.badEmail", value: "ignore" },
      { key: "fsck.hasDotgit", value: "ignore" },
      { key: "fetch.fsck.duplicateEntries", value: "ignore" },
      { key: "fetch.fsck.badFilemode", value: "ignore" },
      { key: "fetch.fsck.missingEmail", value: "ignore" },
      { key: "fetch.fsck.badEmail", value: "ignore" },
      { key: "fetch.fsck.hasDotgit", value: "ignore" },
      { key: "receive.fsck.duplicateEntries", value: "ignore" },
      { key: "receive.fsck.badFilemode", value: "ignore" },
      { key: "receive.fsck.missingEmail", value: "ignore" },
      { key: "receive.fsck.badEmail", value: "ignore" },
      { key: "receive.fsck.hasDotgit", value: "ignore" },
    ],
  },
}

GitLab 16.1 and later include an enhancement that might resolve some of these issues.

Gitaly issue 5625 proposes to ensure that Geo replicates repositories even if the source repository contains problematic commits.

You can also get the error message Synchronization failed - Error syncing repository along with the following log messages. This error indicates that the expected Geo remote is not present in the .git/config file of a repository on the secondary Geo site’s file system:

{
  "created": "@1603481145.084348757",
  "description": "Error received from peer unix:/var/opt/gitlab/gitaly/gitaly.socket",
  
  "grpc_message": "exit status 128",
  "grpc_status": 13
}
{  
  "grpc.request.fullMethod": "/gitaly.RemoteService/FindRemoteRootRef",
  "grpc.request.glProjectPath": "<namespace>/<project>",
  
  "level": "error",
  "msg": "fatal: 'geo' does not appear to be a git repository
          fatal: Could not read from remote repository. …",
}

To solve this:

  1. Sign in on the web interface for the secondary Geo site.

  2. Back up the .git folder.

  3. Optional. Spot-check a few of those IDs whether they indeed correspond to a project with known Geo replication failures. Use fatal: 'geo' as the grep term and the following API call:

    curl --request GET --header "PRIVATE-TOKEN: <your_access_token>" "https://gitlab.example.com/api/v4/projects/<first_failed_geo_sync_ID>"
    
  4. Enter the Rails console and run:

    failed_project_registries = Geo::ProjectRepositoryRegistry.failed
    
    if failed_project_registries.any?
      puts "Found #{failed_project_registries.count} failed project repository registry entries:"
    
      failed_project_registries.each do |registry|
        puts "ID: #{registry.id}, Project ID: #{registry.project_id}, Last Sync Failure: '#{registry.last_sync_failure}'"
      end
    else
      puts "No failed project repository registry entries found."
    end
    
  5. Run the following commands to execute a new sync for each project:

    failed_project_registries.each do |registry|
      registry.replicator.sync
      puts "Sync initiated for registry ID: #{registry.id}, Project ID: #{registry.project_id}"
    end
    

Failures during backfill

During a backfill, failures are scheduled to be retried at the end of the backfill queue, therefore these failures only clear up after the backfill completes.

Message: unexpected disconnect while reading sideband packet

Unstable networking conditions can cause Gitaly to fail when trying to fetch large repository data from the primary site. Those conditions can result in this error:

curl 18 transfer closed with outstanding read data remaining & fetch-pack:
unexpected disconnect while reading sideband packet

This error is more likely to happen if a repository has to be replicated from scratch between sites.

Geo retries several times, but if the transmission is consistently interrupted by network hiccups, an alternative method such as rsync can be used to circumvent git and create the initial copy of any repository that fails to be replicated by Geo.

We recommend transferring each failing repository individually and checking for consistency after each transfer. Follow the single target rsync instructions to transfer each affected repository from the primary to the secondary site.

Project or project wiki repositories

Resync all Geo-replicable objects

You can schedule a full resync or reverification of all Geo-replicable objects from the UI:

  1. On the left sidebar, at the bottom, select Admin.
  2. Select Geo > Sites.
  3. Under Replication details, select the desired object.
  4. Select Resync all or Reverify all.

Alternatively, start a Rails console session on the secondary Geo site to gather more information, or execute these operations manually using the snippets below.

caution
Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.

Find repository verification failures

Get the number of verification failed repositories

Geo::ProjectRepositoryRegistry.verification_failed.count

Find the verification failed repositories

Geo::ProjectRepositoryRegistry.verification_failed

Find repositories that failed to sync

Geo::ProjectRepositoryRegistry.failed

Mark all repositories for reverification

The following snippet marks all project repositories for reverification. After a minute or two, the system should begin to schedule Sidekiq jobs according to your concurrency limits:

Geo::ProjectRepositoryRegistry.update_all(verification_state: 0)

If there’s a very large number of repositories to reverify, the single update query can time out. If this happens, you should run update queries in batches of rows using the same code as the Reverify all feature in the admin area:

::Geo::RegistryBulkUpdateService.new(:reverify_all, Geo::ProjectRepositoryRegistry).execute

Resync project and project wiki repositories

Queue up all repositories for resync

The following snippet marks all project repositories for reverification. After a minute or two, the system should begin to schedule Sidekiq jobs according to your concurrency limits:

Geo::ProjectRepositoryRegistry.update_all(state: 0, last_synced_at: nil)

If there’s a very large number of repositories to reverify, the single update query can time out. If this happens, you should run update queries in batches of rows using the same code as the Reverify all feature in the admin area:

::Geo::RegistryBulkUpdateService.new(:resync_all, Geo::ProjectRepositoryRegistry).execute

Sync individual repository now

project = Project.find_by_full_path('<group/project>')

project.replicator.sync

Sync all failed repositories now

The following script:

  • Loops over all failed repositories.
  • Displays the project details and the reasons for the last failure.
  • Attempts to resync the repository.
  • Reports back if a failure occurs, and why.
  • Might take some time to complete. Each repository check must complete before reporting back the result. If your session times out, take measures to allow the process to continue running such as starting a screen session, or running it using Rails runner and nohup.
Geo::ProjectRepositoryRegistry.failed.find_each do |registry|
   begin
     puts "ID: #{registry.id}, Project ID: #{registry.project_id}, Last Sync Failure: '#{registry.last_sync_failure}'"
     registry.replicator.sync
     puts "Sync initiated for registry ID: #{registry.id}"
   rescue => e
     puts "ID: #{registry.id}, Project ID: #{registry.project_id}, Failed: '#{e}'", e.backtrace.join("\n")
   end
end ; nil

Find repository check failures in a Geo secondary site

note
All repositories data types have been migrated to the Geo Self-Service Framework in GitLab 16.3. There is an issue to implement this functionality back in the Geo Self-Service Framework.

For GitLab 16.2 and earlier:

When enabled for all projects, Repository checks are also performed on Geo secondary sites. The metadata is stored in the Geo tracking database.

Repository check failures on a Geo secondary site do not necessarily imply a replication problem. Here is a general approach to resolve these failures.

  1. Find affected repositories as mentioned below, as well as their logged errors.
  2. Try to diagnose specific git fsck errors. The range of possible errors is wide, try putting them into search engines.
  3. Test typical functions of the affected repositories. Pull from the secondary, view the files.
  4. Check if the primary site’s copy of the repository has an identical git fsck error. If you are planning a failover, then consider prioritizing that the secondary site has the same information that the primary site has. Ensure you have a backup of the primary, and follow planned failover guidelines.
  5. Push to the primary and check if the change gets replicated to the secondary site.
  6. If replication is not automatically working, try to manually sync the repository.

Start a Rails console session to enact the following, basic troubleshooting steps.

caution
Commands that change data can cause damage if not run correctly or under the right conditions. Always run commands in a test environment first and have a backup instance ready to restore.

Get the number of repositories that failed the repository check

Geo::ProjectRegistry.where(last_repository_check_failed: true).count

Find the repositories that failed the repository check

Geo::ProjectRegistry.where(last_repository_check_failed: true)