gitlab-foss/batched_background_migrations.md at 9004962faa8eb43b3debffcb15290febe3158beb

mirror of https://gitlab.com/gitlab-org/gitlab-foss.git synced 2025-08-06 10:19:48 +00:00

Files

GitLab Bot 172925dab8 Add latest changes from gitlab-org/gitlab@master

2025-06-18 18:11:47 +00:00

1550 lines

57 KiB

Markdown

Raw Blame History

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

 ---
 stage: Data Access
 group: Database Frameworks
 info: 'See the Technical Writers assigned to Development Guidelines: https://handbook.gitlab.com/handbook/product/ux/technical-writing/#assignments-to-development-guidelines'
 title: Batched background migrations
 ---
 Batched background migrations should be used to perform data migrations whenever a
 migration exceeds [the time limits](../migration_style_guide.md#how-long-a-migration-should-take)
 in our guidelines. For example, you can use batched background
 migrations to migrate data that's stored in a single JSON column
 to a separate table instead.
 {{< alert type="note" >}}
 Batched background migrations replaced the legacy background migrations framework.
 Check that documentation in reference to any changes involving that framework.
 {{< /alert >}}
 {{< alert type="note" >}}
 The batched background migrations framework has ChatOps support. Using ChatOps, GitLab engineers can interact with the batched background migrations present in the system.
 {{< /alert >}}
 ## When to use batched background migrations
 Use a batched background migration when you migrate data in tables containing
 so many rows that the process would exceed
 [the time limits in our guidelines](../migration_style_guide.md#how-long-a-migration-should-take)
 if performed using a regular Rails migration.
 - Batched background migrations should be used when migrating data in
   [high-traffic tables](../migration_style_guide.md#high-traffic-tables).
 - Batched background migrations may also be used when executing numerous single-row queries
   for every item on a large dataset. Typically, for single-record patterns, runtime is
   largely dependent on the size of the dataset. Split the dataset accordingly,
   and put it into background migrations.
 - Don't use batched background migrations to perform schema migrations.
 Background migrations can help when:
 - Migrating events from one table to multiple separate tables.
 - Populating one column based on JSON stored in another column.
 - Migrating data that depends on the output of external services. (For example, an API.)
 ### Notes
 - If the batched background migration is part of an important upgrade, it must be announced
   in the release post. Discuss with your Project Manager if you're unsure if the migration falls
   into this category.
 - You should use the [generator](#generate-a-batched-background-migration) to create batched background migrations,
   so that required files are created by default.
 ## How batched background migrations work
 Batched background migrations (BBM) are subclasses of
 `Gitlab::BackgroundMigration::BatchedMigrationJob` that define a `perform` method.
 As the first step, a regular migration creates a `batched_background_migrations`
 record with the BBM class and the required arguments. By default,
 `batched_background_migrations` is in an active state, and those are picked up
 by the Sidekiq worker to execute the actual batched migration.
 All migration classes must be defined in the namespace `Gitlab::BackgroundMigration`. Place the files
 in the directory `lib/gitlab/background_migration/`.
 ### Execution mechanism
 Batched background migrations are picked from the queue in the order they are enqueued. Multiple migrations are fetched
 and executed in parallel, as long they are in active state and do not target the same database table.
 The default number of migrations processed in parallel is 2, for GitLab.com this limit is configured to 4.
 Once migration is picked for execution, a job is created for the specific batch. After each job execution, migration's
 batch size may be increased or decreased, based on the performance of the last 20 jobs.
 ```plantuml
 @startuml
 hide empty description
 skinparam ConditionEndStyle hline
 left to right direction
 rectangle "Batched background migration queue" as migrations {
   rectangle "Migration N (active)" as migrationn
   rectangle "Migration 1 (completed)" as migration1
   rectangle "Migration 2 (active)" as migration2
   rectangle "Migration 3 (on hold)" as migration3
   rectangle "Migration 4 (active)" as migration4
   migration1 -[hidden]> migration2
   migration2 -[hidden]> migration3
   migration3 -[hidden]> migration4
   migration4 -[hidden]> migrationn
 }
 rectangle "Execution Workers" as workers {
  rectangle "Execution Worker 1 (busy)" as worker1
  rectangle "Execution Worker 2 (available)" as worker2
  worker1 -[hidden]> worker2
 }
 migration2 --> [Scheduling Worker]
 migration4 --> [Scheduling Worker]
 [Scheduling Worker] --> worker2
 @enduml
 ```
 Soon as a worker is available, the BBM is processed by the runner.
 ```plantuml
 @startuml
 hide empty description
 start
 rectangle Runner {
   :Migration;
   if (Have reached batching bounds?) then (Yes)
     if (Have jobs to retry?) then (Yes)
       :Fetch the batched job;
     else (No)
       :Finish active migration;
       stop
     endif
   else (No)
     :Create a batched job;
   endif
   :Execute batched job;
   :Evaluate DB health;
   note right: Checks for table autovacuum, Patroni Apdex, Write-ahead logging
   if (Evaluation signs to stop?) then (Yes)
     :Put migration on hold;
   else (No)
     :Optimize migration;
   endif
 }
 @enduml
 ```
 ### Idempotence
 Batched background migrations are executed in a context of a Sidekiq process.
 The usual Sidekiq rules apply, especially the rule that jobs should be small
 and idempotent. Ensure that in the case where your migration job is retried, data
 integrity is guaranteed.
 See [Sidekiq best practices guidelines](https://github.com/mperham/sidekiq/wiki/Best-Practices)
 for more details.
 ### Migration optimization
 After each job execution, a verification takes place to check if the migration can be optimized.
 The optimization underlying mechanic is based on the concept of time efficiency. It calculates
 the exponential moving average of time efficiencies for the last N jobs and updates the batch
 size of the batched background migration to its optimal value.
 This mechanism, however, makes it hard for us to provide an accurate estimation for total
 execution time of the migration when using the [database migration pipeline](database_migration_pipeline.md).
 We are discussing the ways to fix this problem in
 [this issue](https://gitlab.com/gitlab-org/database-team/gitlab-com-database-testing/-/issues/162)
 ### Job retry mechanism
 The batched background migrations retry mechanism ensures that a job is executed again in case of failure.
 The following diagram shows the different stages of our retry mechanism:
 ```plantuml
 @startuml
 hide empty description
 note as N1
   can_split?:
   the failure is due to a query timeout
 end note
     [*] --> Running
 Running --> Failed
 note on link
   if number of retries <= MAX_ATTEMPTS
 end note
 Running --> Succeeded
 Failed --> Running
 note on link
   if number of retries > MAX_ATTEMPTS
   and can_split? == true
   then two jobs with smaller
   batch size will be created
 end note
 Failed --> [*]
 Succeeded --> [*]
 @enduml
 ```
 - `MAX_ATTEMPTS` is defined in the [`Gitlab::Database::BackgroundMigration`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/database/background_migration/batched_job.rb)
   class.
 - `can_split?` is defined in the [`Gitlab::Database::BatchedJob`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/database/background_migration/batched_job.rb) class.
 ### Failed batched background migrations
 The whole batched background migration is marked as `failed`
 (`/chatops run batched_background_migrations status MIGRATION_ID` shows
 the migration as `failed`) if any of the following is true:
 - There are no more jobs to consume, and there are failed jobs.
 - More than [half of the jobs failed since the background migration was started](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/database/background_migration/batched_migration.rb#L160).
 ### Throttling batched migrations
 Because batched migrations are update heavy and there have been incidents due to the heavy load from these migrations while the database was underperforming, a throttling mechanism exists to mitigate future incidents.
 These database indicators are checked to throttle a migration. Upon receiving a
 stop signal, the migration is paused for a set time (10 minutes):
 - WAL queue pending archival crossing the threshold.
 - Active autovacuum on the tables on which the migration works on (enabled by default as of GitLab 18.0).
 - Patroni apdex SLI dropping below the SLO.
 - WAL rate crossing the threshold.
 There is an ongoing effort to add more indicators to further enhance the
 database health check framework. For more details, see
 [epic 7594](https://gitlab.com/groups/gitlab-org/-/epics/7594).
 #### How to disable/enable autovacuum indicator on tables
 As of GitLab 18.0, this health indicator is enabled by default. To disable it, run the following command on the rails console:
 ```ruby
 Feature.disable(:batched_migrations_health_status_autovacuum)
 ```
 Alternatively, if you want to enable it again, run the following command in rails console:
 ```ruby
 Feature.enable(:batched_migrations_health_status_autovacuum)
 ```
 ### Isolation
 Batched background migrations must be isolated and cannot use application code (for example,
 models defined in `app/models` except the `ApplicationRecord` classes).
 Because these migrations can take a long time to run, it's possible
 for new versions to deploy while the migrations are still running.
 ### Depending on migrated data
 Unlike a regular or a post migration, waiting for the next release is not enough to guarantee that the data was fully migrated.
 That means that you shouldn't depend on the data until the BBM is finished. If having 100% of the data migrated is a requirement,
 then, the `ensure_batched_background_migration_is_finished` helper can be used to guarantee that the migration was finished and the
 data fully migrated. ([See an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb#L13-18)).
 ## How to
 ### Generate a batched background migration
 The custom generator `batched_background_migration` scaffolds necessary files and
 accepts `table_name`, `column_name`, and `feature_category` as arguments. When
 choosing the `column_name`, ensure that you are using a column type that can be iterated over distinctly,
 preferably the table's primary key. The table will be iterated over based on the column defined here.
 For more information, see [Batch over non-distinct columns](#batch-over-non-distinct-columns).
 Usage:
 ```shell
 bundle exec rails g batched_background_migration my_batched_migration --table_name=<table-name> --column_name=<column-name> --feature_category=<feature-category>
 ```
 This command creates the following files:
 - `db/post_migrate/20230214231008_queue_my_batched_migration.rb`
 - `spec/migrations/20230214231008_queue_my_batched_migration_spec.rb`
 - `lib/gitlab/background_migration/my_batched_migration.rb`
 - `spec/lib/gitlab/background_migration/my_batched_migration_spec.rb`
 ### Enqueue a batched background migration
 Queueing a batched background migration should be done in a post-deployment
 migration. Use this `queue_batched_background_migration` example, queueing the
 migration to be executed in batches. Replace the class name and arguments with the values
 from your migration:
 ```ruby
 queue_batched_background_migration(
   JOB_CLASS_NAME,
   TABLE_NAME,
   JOB_ARGUMENTS
   )
 ```
 {{< alert type="note" >}}
 This helper raises an error if the number of provided job arguments does not match
 the number of [job arguments](#use-job-arguments) defined in `JOB_CLASS_NAME`.
 {{< /alert >}}
 Make sure the newly-created data is either migrated, or
 saved in both the old and new version upon creation. Removals in
 turn can be handled by defining foreign keys with cascading deletes.
 ### Finalize a batched background migration
 Finalizing a batched background migration is done by calling
 `ensure_batched_background_migration_is_finished`, after at-least, one required stop from queuing it.
 This ensures a smooth upgrade process for GitLab Self-Managed instances.
 It is important to finalize all batched background migrations when it is safe
 to do so. Leaving around old batched background migration is a form of
 technical debt that needs to be maintained in tests and in application
 behavior.
 {{< alert type="note" >}}
 You cannot depend on any batched background migration being completed until after it is finalized.
 {{< /alert >}}
 We recommend that batched background migrations are finalized after all of the
 following conditions are met:
 - The batched background migration is completed on GitLab.com
 - The batched background migration was added in or before the last [required stop](required_stops.md). For example if 17.8 is a required stop and the migration was added in 17.7, the [finalizing migration can be added in 17.9](required_stops.md#long-running-migrations-being-finalized).
 The `ensure_batched_background_migration_is_finished` call must exactly match
 the migration that was used to enqueue it. Pay careful attention to:
 - The job arguments: Needs to exactly match or it will not find the queued migration
 - The `gitlab_schema`: Needs to exactly match or it will not find the queued
   migration. Even if the `gitlab_schema` of the table has changed from
   `gitlab_main` to `gitlab_main_cell` in the meantime you must finalize it
   with `gitlab_main` if that's what was used when queueing the batched
   background migration.
 When finalizing a batched background migration you also need to update the
 `finalized_by` in the corresponding `db/docs/batched_background_migrations`
 file. The value should be the timestamp/version of the migration you added to
 finalize it.
 See the below [Examples](#examples) for specific details on what the actual
 migration code should be.
 {{< alert type="note" >}}
 If the migration is being finalized before one required stop since it was enqueued, an early finalization
 error will be raised. If the migration requires to be finalized before one required stop,
 use `skip_early_finalization_validation: true` option to skip this check.
 {{< /alert >}}
 ### Deleting batched background migration code
 Once a batched background migration has completed, is finalized and has not been [re-queued](#re-queue-batched-background-migrations),
 the migration code in `lib/gitlab/background_migration/` and its associated tests can be deleted after the next required stop following
 the finalization.
 Here is an example scenario:
 - 17.3 and 17.5 are required stops.
 - In 17.1 the batched background migration is queued.
 - In 17.4 the migration may be finalized, provided that it's completed in GitLab.com.
 - In 17.6 the code related to the migration may be deleted.
 Batched background migration code is routinely deleted when [migrations are squashed](migration_squashing.md).
 ### Re-queue batched background migrations
 A batched background migration might need to be re-run for one of several
 reasons:
 - The migration contains a bug ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/93546)).
 - The migration cleaned up data but the data became de-normalized again due to a
   bypass in application logic ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/123002)).
 - The batch size of the original migration causes the migration to fail ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/121404)).
 To requeue a batched background migration, you must:
 - No-op the contents of the `#up` and `#down` methods of the
   original migration file. Otherwise, the batched background migration is created,
   deleted, then created again on systems that are upgrading multiple patch
   releases at once.
 - Add a new post-deployment migration that re-runs the batched background
   migration.
 - In the new post-deployment migration, delete the existing batched background
   migration using the `delete_batched_background_migration` method at the start
   of the `#up` method to ensure that any existing runs are cleaned up.
 - Update the `db/docs/batched_background_migration/*.yml` file from the original
   migration to include information about the requeue.
 #### Example
 **Original Migration**:
 ```ruby
 # frozen_string_literal: true
 class QueueResolveVulnerabilitiesForRemovedAnalyzers < Gitlab::Database::Migration[2.2]
   milestone '17.3'
   MIGRATION = "ResolveVulnerabilitiesForRemovedAnalyzers"
   def up
     # no-op because there was a bug in the original migration, which has been
     # fixed by
   end
   def down
     # no-op because there was a bug in the original migration, which has been
     # fixed in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/162527
   end
 end
 ```
 **Requeued migration**:
 ```ruby
 # frozen_string_literal: true
 class RequeueResolveVulnerabilitiesForRemovedAnalyzers < Gitlab::Database::Migration[2.2]
   milestone '17.4'
   restrict_gitlab_migration gitlab_schema: :gitlab_main
   MIGRATION = "ResolveVulnerabilitiesForRemovedAnalyzers"
   BATCH_SIZE = 10_000
   SUB_BATCH_SIZE = 100
   def up
     # Clear previous background migration execution from QueueResolveVulnerabilitiesForRemovedAnalyzers
     delete_batched_background_migration(MIGRATION, :vulnerability_reads, :id, [])
     queue_batched_background_migration(
       MIGRATION,
       :vulnerability_reads,
       :id,
       batch_size: BATCH_SIZE,
       sub_batch_size: SUB_BATCH_SIZE
     )
   end
   def down
     delete_batched_background_migration(MIGRATION, :vulnerability_reads, :id, [])
   end
 end
 ```
 **Batched migration dictionary**:
 The `milestone` and `queued_migration_version` should be the ones of requeued migration (in this example: RequeueResolveVulnerabilitiesForRemovedAnalyzers).
 ```markdown
 ---
 migration_job_name: ResolveVulnerabilitiesForRemovedAnalyzers
 description: Resolves all detected vulnerabilities for removed analyzers.
 feature_category: static_application_security_testing
 introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/162691
 milestone: '17.4'
 queued_migration_version: 20240814085540
 finalized_by: # version of the migration that finalized this BBM
 ```
 ### Stop and remove batched background migrations
 A batched background migration in running state can be stopped and removed for several reasons:
 - When the migration is no longer relevant or required as the product use case changed.
 - The migration has to be superseded with another migration with a different logic.
 To stop and remove an in-progress batched background migration, you must:
 - In Release N, No-op the contents of the `#up` and `#down` methods of the scheduling database migration.
 ```ruby
 class BackfillNamespaceType < Gitlab::Database::Migration[2.1]
   # Reason why we don't need the BBM anymore. E.G: This BBM is no longer needed because it will be superseded by another BBM with different logic.
   def up; end
   def down; end
 end
 ```
 - In Release N, add a regular migration, to delete the existing batched migration.
   Delete the existing batched background migration using the `delete_batched_background_migration` method at the
   start of the `#up` method to ensure that any existing runs are cleaned up.
 ```ruby
 class CleanupBackfillNamespaceType < Gitlab::Database::Migration[2.1]
   MIGRATION = "MyMigrationClass"
   restrict_gitlab_migration gitlab_schema: :gitlab_main
   def up
     delete_batched_background_migration(MIGRATION, :vulnerabilities, :id, [])
   end
   def down; end
 end
 ```
 - In Release N, also delete the migration class file (`lib/gitlab/background_migration/my_batched_migration.rb`) and its specs.
 All the above steps can be implemented in a single MR.
 ### Use job arguments
 `BatchedMigrationJob` provides the `job_arguments` helper method for job classes to define the job arguments they need.
 Batched migrations scheduled with `queue_batched_background_migration` **must** use the helper to define the job arguments:
 ```ruby
 queue_batched_background_migration(
   'CopyColumnUsingBackgroundMigrationJob',
   TABLE_NAME,
   'name', 'name_convert_to_text'
 )
 ```
 {{< alert type="note" >}}
 If the number of defined job arguments does not match the number of job arguments provided when
 scheduling the migration, `queue_batched_background_migration` raises an error.
 {{< /alert >}}
 In this example, `copy_from` returns `name`, and `copy_to` returns `name_convert_to_text`:
 ```ruby
 class CopyColumnUsingBackgroundMigrationJob < BatchedMigrationJob
   job_arguments :copy_from, :copy_to
   operation_name :update_all
   def perform
     from_column = connection.quote_column_name(copy_from)
     to_column = connection.quote_column_name(copy_to)
     assignment_clause = "#{to_column} = #{from_column}"
     each_sub_batch do |relation|
       relation.update_all(assignment_clause)
     end
   end
 end
 ```
 ### Use filters
 By default, when creating background jobs to perform the migration, batched background migrations
 iterate over the full specified table. This iteration is done using the
 [`PrimaryKeyBatchingStrategy`](https://gitlab.com/gitlab-org/gitlab/-/blob/c9dabd1f4b8058eece6d8cb4af95e9560da9a2ee/lib/gitlab/database/migrations/batched_background_migration_helpers.rb#L17). If the table has 1000 records
 and the batch size is 100, the work is batched into 10 jobs. For illustrative purposes,
 `EachBatch` is used like this:
 ```ruby
 # PrimaryKeyBatchingStrategy
 Namespace.each_batch(of: 100) do |relation|
   relation.where(type: nil).update_all(type: 'User') # this happens in each background job
 end
 ```
 #### Using a composite or partial index to iterate a subset of the table
 When applying additional filters, it is important to ensure they are properly
 [covered by an index](iterating_tables_in_batches.md#example-2-iteration-with-filters)
 to optimize `EachBatch` performance.
 In the below examples we need an index on `(type, id)` or `id WHERE type IS NULL`
 to support the filters. See
 the [`EachBatch` documentation](iterating_tables_in_batches.md) for more information.
 If you have a suitable index and you want to iterate only a subset of the table
 you can apply a `where` clause before the `each_batch` like:
 ```ruby
 # Works well if there is an index like either of:
 #  - `id WHERE type IS NULL`
 #  - `(type, id)`
 # Does not work well otherwise.
 Namespace.where(type: nil).each_batch(of: 100) do |relation|
   relation.update_all(type: 'User')
 end
 ```
 An advantage of this approach is that you get consistent batch sizes. But it is
 only suitable where there is an index that matches the `where` clauses as well
 as the batching strategy.
 `BatchedMigrationJob` provides a `scope_to` helper method to apply additional filters and achieve this:
 . Create a new migration job class that inherits from `BatchedMigrationJob` and defines the additional filter:
    ```ruby
    class BackfillNamespaceType < BatchedMigrationJob
      # Works well if there is an index like either of:
      #  - `id WHERE type IS NULL`
      #  - `(type, id)`
      # Does not work well otherwise.
      scope_to ->(relation) { relation.where(type: nil) }
      operation_name :update_all
      feature_category :source_code_management
      def perform
        each_sub_batch do |sub_batch|
          sub_batch.update_all(type: 'User')
        end
      end
    end
    ```
    {{< alert type="note" >}}
    For EE migrations that define `scope_to`, ensure the module extends `ActiveSupport::Concern`.
    Otherwise, records are processed without taking the scope into consideration.
    {{< /alert >}}
 . In the post-deployment migration, enqueue the batched background migration:
    ```ruby
    class BackfillNamespaceType < Gitlab::Database::Migration[2.1]
      MIGRATION = 'BackfillNamespaceType'
      restrict_gitlab_migration gitlab_schema: :gitlab_main
      def up
        queue_batched_background_migration(
          MIGRATION,
          :namespaces,
          :id
        )
      end
      def down
        delete_batched_background_migration(MIGRATION, :namespaces, :id, [])
      end
    end
    ```
 ### Access data for multiple databases
 Background migration contrary to regular migrations does have access to multiple databases
 and can be used to efficiently access and update data across them. To properly indicate
 a database to be used it is desired to create ActiveRecord model inline the migration code.
 Such model should use a correct [`ApplicationRecord`](multiple_databases.md#gitlab-schema)
 depending on which database the table is located. As such usage of `ActiveRecord::Base`
 is disallowed as it does not describe a explicitly database to be used to access given table.
 ```ruby
 # good
 class Gitlab::BackgroundMigration::ExtractIntegrationsUrl
   class Project < ::ApplicationRecord
     self.table_name = 'projects'
   end
   class Build < ::Ci::ApplicationRecord
     self.table_name = 'ci_builds'
   end
 end
 # bad
 class Gitlab::BackgroundMigration::ExtractIntegrationsUrl
   class Project < ActiveRecord::Base
     self.table_name = 'projects'
   end
   class Build < ActiveRecord::Base
     self.table_name = 'ci_builds'
   end
 end
 ```
 Similarly the usage of `ActiveRecord::Base.connection` is disallowed and needs to be
 replaced preferably with the usage of model connection.
 ```ruby
 # good
 Project.connection.execute("SELECT * FROM projects")
 # acceptable
 ApplicationRecord.connection.execute("SELECT * FROM projects")
 # bad
 ActiveRecord::Base.connection.execute("SELECT * FROM projects")
 ```
 ### Batch over non-distinct columns
 The default batching strategy provides an efficient way to iterate over primary key columns.
 However, if you need to iterate over columns where values are not unique, you must use a
 different batching strategy.
 The `LooseIndexScanBatchingStrategy` batching strategy uses a special version of [`EachBatch`](iterating_tables_in_batches.md#loose-index-scan-with-distinct_each_batch)
 to provide efficient and stable iteration over the distinct column values.
 This example shows a batched background migration where the `issues.project_id` column is used as
 the batching column.
 Database post-migration:
 ```ruby
 class ProjectsWithIssuesMigration < Gitlab::Database::Migration[2.1]
   MIGRATION = 'BatchProjectsWithIssues'
   BATCH_SIZE = 5000
   SUB_BATCH_SIZE = 500
   restrict_gitlab_migration gitlab_schema: :gitlab_main
   disable_ddl_transaction!
   def up
     queue_batched_background_migration(
       MIGRATION,
       :issues,
       :project_id,
       batch_size: BATCH_SIZE,
       batch_class_name: 'LooseIndexScanBatchingStrategy', # Override the default batching strategy
       sub_batch_size: SUB_BATCH_SIZE
     )
   end
   def down
     delete_batched_background_migration(MIGRATION, :issues, :project_id, [])
   end
 end
 ```
 Implementing the background migration class:
 ```ruby
 module Gitlab
   module BackgroundMigration
     class BatchProjectsWithIssues < Gitlab::BackgroundMigration::BatchedMigrationJob
       include Gitlab::Database::DynamicModelHelpers
       operation_name :backfill_issues
       def perform
         distinct_each_batch do |batch|
           project_ids = batch.pluck(batch_column)
           # do something with the distinct project_ids
         end
       end
     end
   end
 end
 ```
 {{< alert type="note" >}}
 [Additional filters](#use-filters) defined with `scope_to` are ignored by `LooseIndexScanBatchingStrategy` and `distinct_each_batch`.
 {{< /alert >}}
 ### Calculate overall time estimation of a batched background migration
 It's possible to estimate how long a BBM takes to complete. GitLab already provides an estimation through the `db:gitlabcom-database-testing` pipeline.
 This estimation is built based on sampling production data in a test environment and represents the max time that the migration could take and, not necessarily,
 the actual time that the migration takes. In certain scenarios, estimations provided by the `db:gitlabcom-database-testing` pipeline may not be enough to
 calculate all the singularities around the records being migrated, making further calculations necessary. As it made necessary, the formula
 `interval * number of records / max batch size` can be used to determine an approximate estimation of how long the migration takes.
 Where `interval` and `max batch size` refer to options defined for the job, and the `total tuple count` is the number of records to be migrated.
 {{< alert type="note" >}}
 Estimations may be affected by the [migration optimization mechanism](#migration-optimization).
 {{< /alert >}}
 ### Cleaning up a batched background migration
 {{< alert type="note" >}}
 Cleaning up any remaining background migrations must be done in either a major
 or minor release. You must not do this in a patch release.
 {{< /alert >}}
 Because background migrations can take a long time, you can't immediately clean
 things up after queueing them. For example, you can't drop a column used in the
 migration process, as jobs would fail. You must add a separate _post-deployment_
 migration in a future release that finishes any remaining
 jobs before cleaning things up. (For example, removing a column.)
 To migrate the data from column `foo` (containing a big JSON blob) to column `bar`
 (containing a string), you would:
 . Release A:
 . Create a migration class that performs the migration for a row with a given ID.
 . Update new rows using one of these techniques:
       - Create a new trigger for copy operations that don't need application logic.
       - Handle this operation in the model/service as the records are created or updated.
       - Create a new custom background job that updates the records.
 . Queue the batched background migration for all existing rows in a post-deployment migration.
 . Release B:
 . Add a post-deployment migration that checks if the batched background migration is completed.
 . Deploy code so that the application starts using the new column and stops to update new records.
 . Remove the old column.
 Bumping the [import/export version](../../user/project/settings/import_export.md) may
 be required, if importing a project from a prior version of GitLab requires the
 data to be in the new format.
 ### Add indexes to support batched background migrations
 Sometimes it is necessary to add a new or temporary index to support a batched background migration.
 To do this, create the index in a post-deployment migration that precedes the post-deployment
 migration that queues the background migration.
 See the documentation for [adding database indexes](adding_database_indexes.md#analyzing-a-new-index-before-a-batched-background-migration)
 for additional information about some cases that require special attention to allow the index to be used directly after
 creation.
 ### Execute a particular batch on the database testing pipeline
 {{< alert type="note" >}}
 Only [database maintainers](https://gitlab.com/groups/gitlab-org/maintainers/database/-/group_members?with_inherited_permissions=exclude) can view the database testing pipeline artifacts. Ask one for help if you need to use this method.
 {{< /alert >}}
 Let's assume that a batched background migration failed on a particular batch on GitLab.com and you want to figure out which query failed and why. At the moment, we don't have a good way to retrieve query information (especially the query parameters) and rerunning the entire migration with more logging would be a long process.
 Fortunately you can leverage our [database migration pipeline](database_migration_pipeline.md) to rerun a particular batch with additional logging and/or fix to see if it solves the problem.
 For an example see [Draft: `Test PG::CardinalityViolation` fix](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/110910) but make sure to read the entire section.
 To do that, you need to:
 . [Find the batch `start_id` and `end_id`](#find-the-batch-start_id-and-end_id)
 . [Create a regular migration](#create-a-regular-migration)
 . [Apply a workaround for our migration helpers](#apply-a-workaround-for-our-migration-helpers-optional) (optional)
 . [Start the database migration pipeline](#start-the-database-migration-pipeline)
 #### Find the batch `start_id` and `end_id`
 You should be able to find those in [Kibana](#viewing-failure-error-logs).
 #### Create a regular migration
 Schedule the batch in the `up` block of a regular migration:
 ```ruby
 def up
   instance = Gitlab::BackgroundMigration::YourBackgroundMigrationClass.new(
       start_id: <batch start_id>,
       end_id: <batch end_id>,
       batch_table: <table name>,
       batch_column: <batching column>,
       sub_batch_size: <sub batch size>,
       pause_ms: <miliseconds between batches>,
       job_arguments: <job arguments if any>,
       connection: connection
     )
     instance.perform
 end
 def down
   # no-op
 end
 ```
 #### Apply a workaround for our migration helpers (optional)
 If your batched background migration touches tables from a schema other than the one you specified by using `restrict_gitlab_migration` helper (example: the scheduling migration has `restrict_gitlab_migration gitlab_schema: :gitlab_main` but the background job uses tables from the `:gitlab_ci` schema) then the migration will fail. To prevent that from happening you must to monkey patch database helpers so they don't fail the testing pipeline job:
 . Add the schema names to [`RestrictGitlabSchema`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/database/migration_helpers/restrict_gitlab_schema.rb#L57)
 ```diff
 diff --git a/lib/gitlab/database/migration_helpers/restrict_gitlab_schema.rb b/lib/gitlab/database/migration_helpers/restrict_gitlab_schema.rb
 index b8d1d21a0d2d2a23d9e8c8a0a17db98ed1ed40b7..912e20659a6919f771045178c66828563cb5a4a1 100644
 --- a/lib/gitlab/database/migration_helpers/restrict_gitlab_schema.rb
 +++ b/lib/gitlab/database/migration_helpers/restrict_gitlab_schema.rb
@@ -55,7 +55,7 @@ def unmatched_schemas
          end
          def allowed_schemas_for_connection
 -          Gitlab::Database.gitlab_schemas_for_connection(connection)
 +          Gitlab::Database.gitlab_schemas_for_connection(connection) << :gitlab_ci
          end
        end
      end
 ```
 . Add the schema names to [`RestrictAllowedSchemas`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/database/query_analyzers/restrict_allowed_schemas.rb#L82)
 ```diff
 diff --git a/lib/gitlab/database/query_analyzers/restrict_allowed_schemas.rb b/lib/gitlab/database/query_analyzers/restrict_allowed_schemas.rb
 index 4ae3622479f0800c0553959e132143ec9051898e..d556ec7f55adae9d46a56665ce02de782cb09f2d 100644
 --- a/lib/gitlab/database/query_analyzers/restrict_allowed_schemas.rb
 +++ b/lib/gitlab/database/query_analyzers/restrict_allowed_schemas.rb
@@ -79,7 +79,7 @@ def restrict_to_dml_only(parsed)
              tables = self.dml_tables(parsed)
              schemas = self.dml_schemas(tables)
 -            if (schemas - self.allowed_gitlab_schemas).any?
 +            if (schemas - (self.allowed_gitlab_schemas << :gitlab_ci)).any?
                raise DMLAccessDeniedError, \
                  "Select/DML queries (SELECT/UPDATE/DELETE) do access '#{tables}' (#{schemas.to_a}) " \
                  "which is outside of list of allowed schemas: '#{self.allowed_gitlab_schemas}'. " \
 ```
 #### Start the database migration pipeline
 Create a Draft merge request with your changes and trigger the manual `db:gitlabcom-database-testing` job.
 ### Establish dependencies
 In some instances, migrations depended on the completion of previously enqueued BBMs. If the BBMs are
 still running, the dependent migration fails. For example: introducing an unique index on a large table can depend on
 the previously enqueued BBM to handle any duplicate records.
 The following process has been configured to make dependencies more evident while writing a migration.
 - Version of the migration that queued the BBM is stored in `batched_background_migrations` table and in BBM dictionary file.
 - `DEPENDENT_BATCHED_BACKGROUND_MIGRATIONS` constant is added (commented by default) in each migration file.
   To establish the dependency, add `queued_migration_version` of the dependent BBMs. If not, remove
   the commented line.
 - `Migration::UnfinishedDependencies` cop complains if the dependent BBMs are not yet finished. It determines
   whether they got finished by looking up the `finalized_by` key in the
   [BBM dictionary](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/generators/batched_background_migration/templates/batched_background_migration_dictionary.template).
 Example:
 ```ruby
 # db/post_migrate/20231113120650_queue_backfill_routes_namespace_id.rb
 class QueueBackfillRoutesNamespaceId < Gitlab::Database::Migration[2.1]
   MIGRATION = 'BackfillRouteNamespaceId'
   restrict_gitlab_migration gitlab_schema: :gitlab_main
   ...
   ...
   def up
     queue_batched_background_migration(
       MIGRATION,
       ...
     )
   end
 end
 ```
 ```ruby
 # This depends on the finalization of QueueBackfillRoutesNamespaceId BBM
 class AddNotNullToRoutesNamespaceId < Gitlab::Database::Migration[2.1]
   DEPENDENT_BATCHED_BACKGROUND_MIGRATIONS = ["20231113120650"]
   def up
     add_not_null_constraint :routes, :namespace_id
   end
   def down
     remove_not_null_constraint :routes, :namespace_id
   end
 end
 ```
 ## Managing
 {{< alert type="note" >}}
 BBM management takes place through `chatops` integration, which is limited to GitLab team members only.
 {{< /alert >}}
 ### List batched background migrations
 To list the batched background migrations in the system, run this command:
 `/chatops run batched_background_migrations list`
 This command supports the following options:
 - Database selection:
   - `--database DATABASE_NAME`: Connects to the given database:
     - `main`: Uses the main database (default).
     - `ci`: Uses the CI database.
 - Environment selection:
   - `--dev`: Uses the `dev` environment.
   - `--staging`: Uses the `staging` environment.
   - `--staging_ref`: Uses the `staging_ref` environment.
   - `--production` : Uses the `production` environment (default).
 - Filter by job class
   - `--job-class-name JOB_CLASS_NAME`: Only list jobs for the given job class.
   - This is the `migration_job_name` in the YAML definition of the background migration.
 Output example:
 ![Output of the ChatOps command listing all the active batched background migrations.](img/list_v15_4.png)
 {{< alert type="note" >}}
 ChatOps returns 20 batched background migrations order by `created_at` (DESC).
 {{< /alert >}}
 ### Monitor the progress and status of a batched background migration
 To see the status and progress of a specific batched background migration, run this command:
 `/chatops run batched_background_migrations status MIGRATION_ID`
 This command supports the following options:
 - Database selection:
   - `--database DATABASE_NAME`: Connects to the given database:
     - `main`: Uses the main database (default)
     - `ci`: Uses the CI database
 - Environment selection:
   - `--dev`: Uses the `dev` environment.
   - `--staging`: Uses the `staging` environment.
   - `--staging_ref`: Uses the `staging_ref` environment.
   - `--production` : Uses the `production` environment (default).
 Output example:
 ![Output of the ChatOps command to know the progress and status of a specific batched background migration using MIGRATION_ID.](img/status_v15_4.png)
 `Progress` represents the percentage of the background migration that has been completed.
 Definitions of the batched background migration states:
 - **Active**: Either:
   - Ready to be picked by the runner.
   - Running batched jobs.
 - **Finalizing**: Running batched jobs.
 - **Failed**: Failed batched background migration.
 - **Finished**: All jobs were executed successfully and the batched background migration is complete.
 - **Paused**: Not visible to the runner.
 - **Finalized**: Batched migration was verified with
   [`ensure_batched_background_migration_is_finished`](#finalize-a-batched-background-migration) and is complete.
 ### Pause a batched background migration
 If you want to pause a batched background migration, you need to run the following command:
 `/chatops run batched_background_migrations pause MIGRATION_ID`
 This command supports the following options:
 - Database selection:
   - `--database DATABASE_NAME`: Connects to the given database:
     - `main`: Uses the main database (default).
     - `ci`: Uses the CI database.
 - Environment selection:
   - `--dev`: Uses the `dev` environment.
   - `--staging`: Uses the `staging` environment.
   - `--staging_ref`: Uses the `staging_ref` environment.
   - `--production` : Uses the `production` environment (default).
 Output example:
 ![Output of the ChatOps command to pause a specific batched background migration using MIGRATION_ID.](img/pause_v15_4.png)
 {{< alert type="note" >}}
 You can pause only `active` batched background migrations.
 {{< /alert >}}
 ### Resume a batched background migration
 If you want to resume a batched background migration, you need to run the following command:
 `/chatops run batched_background_migrations resume MIGRATION_ID`
 This command supports the following options:
 - Database selection:
   - `--database DATABASE_NAME`: Connects to the given database:
     - `main`: Uses the main database (default).
     - `ci`: Uses the CI database.
 - Environment selection:
   - `--dev`: Uses the `dev` environment.
   - `--staging`: Uses the `staging` environment.
   - `--staging_ref`: Uses the `staging_ref` environment.
   - `--production` : Uses the `production` environment (default).
 Output example:
 ![Output of the ChatOps command to resume a specific batched background migration using MIGRATION_ID.](img/resume_v15_4.png)
 {{< alert type="note" >}}
 You can resume only `active` batched background migrations
 {{< /alert >}}
 ### Enable or disable background migrations
 In extremely limited circumstances, a GitLab administrator can disable either or
 both of these [feature flags](../../administration/feature_flags/_index.md):
 - `execute_background_migrations`
 - `execute_batched_migrations_on_schedule`
 These flags are enabled by default. Disable them only as a last resort
 to limit database operations in special circumstances, like database host maintenance.
 {{< alert type="warning" >}}
 Do not disable either of these flags unless you fully understand the ramifications. If you disable
 the `execute_background_migrations` or `execute_batched_migrations_on_schedule` feature flag,
 GitLab upgrades might fail and data loss might occur.
 {{< /alert >}}
 ## Batched background migrations for EE-only features
 All the background migration classes for EE-only features should be present in GitLab FOSS.
 For this purpose, create an empty class for GitLab FOSS, and extend it for GitLab EE
 as explained in the guidelines for
 [implementing Enterprise Edition features](../ee_features.md#code-in-libgitlabbackground_migration).
 {{< alert type="note" >}}
 Background migration classes for EE-only features that use job arguments should define them
 in the GitLab FOSS class. Definitions are required to prevent job arguments validation from failing when
 migration is scheduled in the GitLab FOSS context.
 {{< /alert >}}
 You can use the [generator](#generate-a-batched-background-migration) to generate an EE-only migration scaffold by passing
 `--ee-only` flag when generating a new batched background migration.
 ## Debug
 ### Viewing failure error logs
 You can view failures in two ways:
 - Via GitLab logs:
 . After running a batched background migration, if any jobs fail,
      view the logs in [Kibana](https://log.gprd.gitlab.net/goto/4cb43f40-f861-11ec-b86b-d963a1a6788e).
      View the production Sidekiq log and filter for:
      - `json.new_state: failed`
      - `json.job_class_name: <Batched Background Migration job class name>`
      - `json.job_arguments: <Batched Background Migration job class arguments>`
 . Review the `json.exception_class` and `json.exception_message` values to help
      understand why the jobs failed.
 . Remember the retry mechanism. Having a failure does not mean the job failed.
      Always check the last status of the job.
 - Via database:
 . Get the batched background migration `CLASS_NAME`.
 . Execute the following query in the PostgreSQL console:
      ```sql
       SELECT migration.id, migration.job_class_name, transition_logs.exception_class, transition_logs.exception_message
       FROM batched_background_migrations as migration
       INNER JOIN batched_background_migration_jobs as jobs
       ON jobs.batched_background_migration_id = migration.id
       INNER JOIN batched_background_migration_job_transition_logs as transition_logs
       ON transition_logs.batched_background_migration_job_id = jobs.id
       WHERE transition_logs.next_status = '2' AND migration.job_class_name = "CLASS_NAME";
      ```
 ## Testing
 Writing tests is required for:
 - The batched background migrations' queueing migration.
 - The batched background migration itself.
 - A cleanup migration.
 The `:migration` and `schema: :latest` RSpec tags are automatically set for
 background migration specs. Refer to the
 [Testing Rails migrations](../testing_guide/testing_migrations_guide.md#testing-a-non-activerecordmigration-class)
 style guide.
 Remember that `before` and `after` RSpec hooks
 migrate your database down and up. These hooks can result in other batched background
 migrations being called. Using `spy` test doubles with
 `have_received` is encouraged, instead of using regular test doubles, because
 your expectations defined in a `it` block can conflict with what is
 called in RSpec hooks. Refer to [issue #35351](https://gitlab.com/gitlab-org/gitlab/-/issues/18839)
 for more details.
 ## Best practices
 . Know how much data you're dealing with.
 . Make sure the batched background migration jobs are idempotent.
 . Confirm the tests you write are not false positives.
 . If the data being migrated is critical and cannot be lost, the
    clean-up migration must also check the final state of the data before completing.
 . Discuss the numbers with a database specialist. The migration may add
    more pressure on DB than you expect. Measure on staging,
    or ask someone to measure on production.
 . Know how much time is required to run the batched background migration.
 . Be careful when silently rescuing exceptions inside job classes. This may lead to
    jobs being marked as successful, even in a failure scenario.
    ```ruby
    # good
    def perform
      each_sub_batch do |sub_batch|
        sub_batch.update_all(name: 'My Name')
      end
    end
    # acceptable
    def perform
      each_sub_batch do |sub_batch|
        sub_batch.update_all(name: 'My Name')
      rescue Exception => error
        logger.error(message: error.message, class: error.class)
        raise
      end
    end
    # bad
    def perform
      each_sub_batch do |sub_batch|
        sub_batch.update_all(name: 'My Name')
      rescue Exception => error
        logger.error(message: error.message, class: self.class.name)
      end
    end
    ```
 . If possible update the entire sub-batch in a single query
    instead of updating each model separately.
    This can be achieve in different ways, depending on the scenario.
    - Generate an `UPDATE` query, and use `FROM` to join the tables
    that provide the necessary values
    ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/184051)).
    - Generate an `UPDATE` query, and use `FROM(VALUES( ...))` to
    pass values calculated beforehand
    ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/177993)).
    - Pass all keys and values to `ActiveRelation#update`.
    ```ruby
    # good
    def perform
      each_sub_batch do |sub_batch|
        connection.execute <<~SQL
          UPDATE fork_networks
          SET organization_id = projects.organization_id
          FROM projects
          WHERE fork_networks.id IN (#{sub_batch.pluck(:id)})
          AND fork_networks.root_project_id = projects.id
          AND fork_networks.organization_id IS NULL
        SQL
      end
    end
    # bad
    def perform
      each_sub_batch do |sub_batch|
        sub_batch.each |fork_network|
          fork_network.update!(organization_id: fork_network.root_project.organization_id)
        end
      end
    end
    ```
 ### Use of scope_to
 When writing a batched background migration class, you have the option to define a `scope_to` block. This block adds an additional qualifier to the query that determines the minimum and maximum range for each batch.
 By default, the batching range is determined using the primary key index, which is highly efficient. However, using `scope_to` means the query must consider only rows matching the given condition, potentially impacting performance.
 Consider the following simple query:
 ```sql
 SELECT id FROM users WHERE id BETWEEN 1 AND 3000;
 ```
 This query is fast because the `id` column is indexed. PostgreSQL can use an index-only scan to return results efficiently. The query plan might look like this:
 ```plain
 QUERY PLAN
 ---------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using users_pkey on users  (cost=0.44..307.24 rows=2751 width=4) (actual time=0.016..177.028 rows=2654 loops=1)
   Index Cond: ((id >= 1) AND (id <= 3000))
   Heap Fetches: 219
 Planning Time: 0.183 ms
 Execution Time: 177.158 ms
 ```
 Now, let's apply a scope:
 ```ruby
 scope_to ->(relation) { relation.where(theme_id: 4) }
 ```
 This results in the following query:
 ```sql
 SELECT id FROM users WHERE id BETWEEN 1 AND 3000 AND theme_id = 4;
 ```
 The associated query plan is less efficient:
 ```plain
 QUERY PLAN
 --------------------------------------------------------------------------------------------------------------------------
 Index Scan using users_pkey on users  (cost=0.44..3773.66 rows=10 width=4) (actual time=8.047..2290.528 rows=28 loops=1)
   Index Cond: ((id >= 1) AND (id <= 3000))
   Filter: (theme_id = 4)
   Rows Removed by Filter: 2626
 Planning Time: 1.292 ms
 Execution Time: 2290.582 ms
 ```
 In this case, PostgreSQL uses an index scan on `id` but applies the `theme_id` filter after row access. This causes many rows to be discarded after retrieval, resulting in degraded performance, over 12x slower in this case.
 #### When to override
 Use `scope_to` **only when the scoped column is indexed**, and ideally, the batching query avoids filtering out rows.
 A strong indicator of good performance is the absence of the `Rows Removed by Filter` line in the query plan.
 Let’s improve performance by indexing the `theme_id` column:
 ```sql
 CREATE INDEX idx_users_theme_id ON users (theme_id);
 ```
 Re-running the same query produces this plan:
 ```plain
 QUERY PLAN
 --------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on users  (cost=691.28..706.53 rows=10 width=4) (actual time=13.532..13.578 rows=28 loops=1)
   Recheck Cond: ((id >= 1) AND (id <= 3000) AND (theme_id = 4))
   Heap Blocks: exact=28
   Buffers: shared hit=41 read=62
   I/O Timings: shared read=0.721
   ->  BitmapAnd  (cost=691.28..691.28 rows=10 width=0) (actual time=13.509..13.511 rows=0 loops=1)
         Buffers: shared hit=13 read=62
         I/O Timings: shared read=0.721
         ->  Bitmap Index Scan on users_pkey  (cost=0.00..45.95 rows=2751 width=0) (actual time=0.390..0.390 rows=2654 loops=1)
               Index Cond: ((id >= 1) AND (id <= 3000))
               Buffers: shared hit=10
         ->  Bitmap Index Scan on idx_users_theme_id  (cost=0.00..645.08 rows=73352 width=0) (actual time=12.933..12.933 rows=69872 loops=1)
               Index Cond: (theme_id = 4)
               Buffers: shared hit=3 read=62
               I/O Timings: shared read=0.721
 Planning:
   Buffers: shared hit=35 read=1 dirtied=2
   I/O Timings: shared read=0.045
 Planning Time: 0.514 ms
 Execution Time: 13.634 ms
 ```
 #### Summary
 Use `scope_to` **only** when:
 - The scoped column is backed by an index.
 - Query plans avoid significant row filtering (`Rows Removed by Filter` is low or absent).
 - Batching remains efficient under real data loads.
 Otherwise, scoping can drastically reduce performance.
 ## Examples
 ### Routes use-case
 The `routes` table has a `source_type` field that's used for a polymorphic relationship.
 As part of a database redesign, we're removing the polymorphic relationship. One step of
 the work is migrating data from the `source_id` column into a new singular foreign key.
 Because we intend to delete old rows later, there's no need to update them as part of the
 background migration.
 . Start by using the generator to create batched background migration files:
    ```shell
    bundle exec rails g batched_background_migration BackfillRouteNamespaceId --table_name=routes --column_name=id --feature_category=source_code_management
    ```
 . Update the migration job (subclass of `BatchedMigrationJob`) to copy `source_id` values to `namespace_id`:
    ```ruby
    class Gitlab::BackgroundMigration::BackfillRouteNamespaceId < BatchedMigrationJob
      # For illustration purposes, if we were to use a local model we could
      # define it like below, using an `ApplicationRecord` as the base class
      # class Route < ::ApplicationRecord
      #   self.table_name = 'routes'
      # end
      operation_name :update_all
      feature_category :source_code_management
      def perform
        each_sub_batch(
          batching_scope: -> (relation) { relation.where("source_type <> 'UnusedType'") }
        ) do |sub_batch|
          sub_batch.update_all('namespace_id = source_id')
        end
      end
    end
    ```
    {{< alert type="note" >}}
    Job classes inherit from `BatchedMigrationJob` to ensure they are
    correctly handled by the batched migration framework. Any subclass of
    `BatchedMigrationJob` is initialized with the necessary arguments to
    execute the batch, and a connection to the tracking database.
    {{< /alert >}}
 . Create a database migration that adds a new trigger to the database. Example:
    ```ruby
    class AddTriggerToRoutesToCopySourceIdToNamespaceId < Gitlab::Database::Migration[2.1]
      FUNCTION_NAME = 'example_function'
      TRIGGER_NAME = 'example_trigger'
      def up
        execute(<<~SQL)
          CREATE OR REPLACE FUNCTION #{FUNCTION_NAME}() RETURNS trigger
          LANGUAGE plpgsql
          AS $$
          BEGIN
            NEW."namespace_id" = NEW."source_id"
            RETURN NEW;
          END;
          $$;
          CREATE TRIGGER #{TRIGGER_NAME}() AFTER INSERT OR UPDATE
          ON routes
          FOR EACH ROW EXECUTE FUNCTION #{FUNCTION_NAME}();
        SQL
      end
      def down
        drop_trigger(TRIGGER_NAME, :routes)
        drop_function(FUNCTION_NAME)
      end
    end
    ```
 . Update the created post-deployment migration with required batch sizes:
    ```ruby
    class QueueBackfillRoutesNamespaceId < Gitlab::Database::Migration[2.1]
      MIGRATION = 'BackfillRouteNamespaceId'
      BATCH_SIZE = 1000
      SUB_BATCH_SIZE = 100
      restrict_gitlab_migration gitlab_schema: :gitlab_main
      def up
        queue_batched_background_migration(
          MIGRATION,
          :routes,
          :id,
          batch_size: BATCH_SIZE,
          sub_batch_size: SUB_BATCH_SIZE
        )
      end
      def down
        delete_batched_background_migration(MIGRATION, :routes, :id, [])
      end
    end
    ```
    ```yaml
     # db/docs/batched_background_migrations/backfill_route_namespace_id.yml
     ---
     migration_job_name: BackfillRouteNamespaceId
     description: Copies source_id values from routes to namespace_id
     feature_category: source_code_management
     introduced_by_url: "https://mr_url"
     milestone: 16.6
     queued_migration_version: 20231113120650
     finalized_by: # version of the migration that ensured this bbm
    ```
    {{< alert type="note" >}}
    When queuing a batched background migration, you need to restrict
    the schema to the database where you make the actual changes.
    In this case, we are updating `routes` records, so we set
    `restrict_gitlab_migration gitlab_schema: :gitlab_main`. If, however,
    you need to perform a CI data migration, you would set
    `restrict_gitlab_migration gitlab_schema: :gitlab_ci`.
    {{< /alert >}}
    After deployment, our application:
      - Continues using the data as before.
      - Ensures that both existing and new data are migrated.
 . Add a new post-deployment migration that checks that the batched background migration is complete. Also update
    `finalized_by` attribute in BBM dictionary with the version of this migration.
    ```ruby
    class FinalizeBackfillRouteNamespaceId < Gitlab::Database::Migration[2.1]
      disable_ddl_transaction!
      restrict_gitlab_migration gitlab_schema: :gitlab_main
      def up
        ensure_batched_background_migration_is_finished(
          job_class_name: 'BackfillRouteNamespaceId',
          table_name: :routes,
          column_name: :id,
          job_arguments: [],
          finalize: true
        )
      end
      def down
        # no-op
      end
    end
    ```
    ```yaml
     # db/docs/batched_background_migrations/backfill_route_namespace_id.yml
     ---
     migration_job_name: BackfillRouteNamespaceId
     description: Copies source_id values from routes to namespace_id
     feature_category: source_code_management
     introduced_by_url: "https://mr_url"
     milestone: 16.6
     queued_migration_version: 20231113120650
     finalized_by: 20231115120912
    ```
    {{< alert type="note" >}}
    If the batched background migration is not finished, the system will
    execute the batched background migration inline. If you don't want
    to see this behavior, you need to pass `finalize: false`.
    {{< /alert >}}
    If the application does not depend on the data being 100% migrated (for
    instance, the data is advisory, and not mission-critical), then you can skip this
    final step. This step confirms that the migration is completed, and all of the rows were migrated.
 . Add a database migration to remove the trigger.
    ```ruby
    class RemoveNamepaceIdTriggerFromRoutes < Gitlab::Database::Migration[2.1]
      FUNCTION_NAME = 'example_function'
      TRIGGER_NAME = 'example_trigger'
      def up
        drop_trigger(TRIGGER_NAME, :routes)
        drop_function(FUNCTION_NAME)
      end
      def down
        # Should reverse the trigger and the function in the up method of the migration that added it
      end
    end
    ```
 After the batched migration is completed, you can safely depend on the
 data in `routes.namespace_id` being populated.

1550 lines 57 KiB Markdown Raw Blame History Unescape Escape

1550 lines

57 KiB

Markdown

Raw Blame History