Migrating a terabyte of Drupal files without the wait

Drupal’s standard file migration copies every managed file byte-for-byte. On a site with a terabyte of documents that turns a routine migration into a multi-day disk-and-bandwidth slog — and makes it nearly impossible to iterate on from a developer’s laptop. Almost all of that copying is avoidable.

The approach is to migrate the file records without the bytes: register each managed-file entity and its URI in Drupal, confirm what is genuinely missing with a cheap HTTP HEAD check, then back-fill the actual files on demand — in bulk ahead of go-live, or a handful at a time on a developer machine. The real data moves once, with the right tool; everything else stays a fast metadata operation. The payoff is a migration you can run and re-run in minutes from a laptop, and a production cutover that pre-seeds the bulk in advance and finishes with a quick delta.

The rest of this article gets into the weeds: the process plugins that register files lazily, the source-plugin HEAD check that finds what is actually missing, the on-demand back-fill, and the local-development workflow that — honestly — is the part that saved the project.

On two recent enterprise Drupal migrations we ran into the same problem from opposite directions. For a national not‑for‑profit, the old and new sites shared a server that simply didn’t have room to hold two copies of every file — and a standard file migration copies every byte, doubling the footprint. For a federal government agency, the file store was on the order of a terabyte, and pushing that volume through PHP would have taken days.

Different constraints — disk space in one case, sheer volume in the other — but the same fix: stop copying files inside the migration at all. And once we stopped treating files as bytes to shovel around, we could do something better with them on the way through.

The bottleneck: the standard migration copies every byte

A conventional file migration uses the file_copy process plugin. For every record it streams the file from the source — often over HTTP — into the new site’s filesystem:

process:
  uri:
    plugin: file_copy
    source:
      - source_full_path   # https://legacy.example.org/sites/default/files/…
      - uri                # public://…

Fine for a few gigabytes on a roomy disk. But notice what file_copy leaves you with: two copies of every file — the source set and the migrated set. If you’re tight on space you can’t afford to double your file footprint just to migrate; if you’ve got a terabyte you can’t afford the time either. It’s also hard on the source server, fragile over long runs, and every test rebuild starts the whole transfer again.

The insight: a file record and its bytes are two separate problems

A file_managed row is just metadata — an ID, a URI, a MIME type. The bytes are a separate concern, and filesystems already have brilliantly optimised, resumable, parallel tools for moving bytes: rsync and friends (mv, hard‑links). PHP doesn’t need to be in that loop.

So we register the file records in place, pointing at the URI where each file will live, and move the bytes out of band.

Step 1 — Register the record, skip the bytes

A small process plugin does the work. In lazy mode it creates the file entity at the destination URI and returns its fid, but never touches the filesystem:

namespace Drupal\acme_migrate\Plugin\migrate\process;

use Drupal\file\Entity\File;
use Drupal\migrate\Attribute\MigrateProcess;
use Drupal\migrate\MigrateExecutableInterface;
use Drupal\migrate\ProcessPluginBase;
use Drupal\migrate\Row;

/**
 * Creates a file entity at a given URI — optionally without copying the binary.
 *
 * With `lazy: true` we register the file_managed record pointing at the
 * destination URI and return its fid, but never write to disk. The bytes
 * arrive later: on demand via Stage File Proxy in dev, or in bulk via rsync at
 * go-live. The URI is kept 1:1 with the legacy path so both can resolve it.
 */
#[MigrateProcess(id: 'local_file_to_entity')]
final class LocalFileToEntity extends ProcessPluginBase {

  public function transform($value, MigrateExecutableInterface $migrate, Row $row, $destination_property): ?int {
    if (empty($value)) {
      return NULL;
    }
    // Keep the original path so Stage File Proxy / rsync can resolve it later.
    $uri = $row->getSourceProperty($this->configuration['destination_uri']);
    if (!$uri) {
      return NULL;
    }

    // Reuse an existing file entity at this URI — never create duplicates.
    $existing = \Drupal::entityTypeManager()->getStorage('file')
      ->loadByProperties(['uri' => $uri]);
    if ($existing) {
      return (int) reset($existing)->id();
    }

    if (!empty($this->configuration['lazy'])) {
      // The whole trick: a real file record, no bytes on disk.
      $file = File::create([
        'uri' => $uri,
        'filename' => basename($uri),
        'status' => 1,
      ]);
      $file->save();
      return (int) $file->id();
    }

    // (Eager mode — copy the binary now — elided. Used only for small sets.)
  }
}

The base file migration uses it (or, for plain file_managed rows, simply maps the URI straight through). Either way the migration now writes database rows, not bytes, and finishes in minutes:

id: acme_file
source:
  plugin: d7_file
  scheme: public
process:
  fid: fid
  filename: filename
  filemime: filemime
  status: status
  uid: uid
  created: timestamp
  changed: timestamp
  uri:
    plugin: preserve_existing_uri   # see the gotcha near the end
    source: uri
destination:
  plugin: 'entity:file'
  validate_dependencies: false       # don't fail because the byte isn't here yet

Every node, media and field reference now resolves to a valid file entity with the correct URI. The bytes simply haven’t landed.

Step 2 — Know exactly which files are really coming

Because we never copied anything, we need a precise list of what still has to arrive. A source plugin walks file_managed for rows whose binary isn’t on disk — but “not on disk locally” isn’t the same as “exists at the source.” A file_managed row can be an orphan: it points at a binary the old site never actually had. Left unchecked, those sit on the backfill list forever and masquerade as failed downloads.

So before yielding a row, we do a cheap HTTP HEAD against the origin. No body transfer — just a 200/404 confirming the file is genuinely there and will arrive on the rsync:

namespace Drupal\acme_migrate\Plugin\migrate\source;

use Drupal\migrate\Attribute\MigrateSource;
use Drupal\migrate\Plugin\migrate\source\SourcePluginBase;

/**
 * Yields one row per file_managed entry whose binary is missing on disk
 * AND confirmed to exist at the origin (via a HEAD request).
 */
#[MigrateSource(id: 'missing_files')]
final class MissingFiles extends SourcePluginBase {

  public function fields(): array {
    return [
      'fid' => $this->t('File ID'),
      'uri' => $this->t('Destination URI'),
      'source_url' => $this->t('URL to fetch from'),
    ];
  }

  public function getIds(): array {
    return ['fid' => ['type' => 'integer']];
  }

  public function __toString(): string {
    return 'missing_files';
  }

  protected function initializeIterator(): \Iterator {
    $base = rtrim((string) $this->configuration['base_url'], '/') . '/';
    $db   = \Drupal::database();
    $fs   = \Drupal::service('file_system');
    $http = \Drupal::httpClient();

    $rows = [];
    $files = $db->select('file_managed', 'fm')
      ->fields('fm', ['fid', 'uri'])
      ->condition('fm.uri', $db->escapeLike('public://') . '%', 'LIKE')
      ->execute();

    foreach ($files as $file) {
      // Already on disk? Nothing to fetch.
      $real = $fs->realpath($file->uri);
      if ($real && file_exists($real)) {
        continue;
      }

      $relative = ltrim(substr($file->uri, strlen('public://')), '/');
      // Encode each segment so #, +, spaces, & in filenames survive as a path.
      $encoded = implode('/', array_map('rawurlencode', explode('/', $relative)));
      $url = $base . $encoded;

      // Confirm the origin actually has it before promising it's "coming".
      // A HEAD is cheap (headers only) and weeds out orphaned file_managed
      // rows that point at a binary the source never had.
      try {
        $head = $http->head($url, ['http_errors' => FALSE, 'timeout' => 15]);
      }
      catch (\Throwable $e) {
        continue;
      }
      if ($head->getStatusCode() !== 200) {
        continue; // False positive: a record with no source file. Skip it.
      }

      $rows[(int) $file->fid] = [
        'fid' => (int) $file->fid,
        'uri' => $file->uri,
        'source_url' => $url,
      ];
    }
    ksort($rows);
    return new \ArrayIterator(array_values($rows));
  }

  public function count($refresh = FALSE): int {
    return iterator_count($this->initializeIterator());
  }
}

This is doubly useful for private files — with one wrinkle. The source enforces access control on private://, so a plain HEAD or download comes back 403, not the file. To bring private files through the same machinery, you teach the source to let the migration past its own privacy checks: a small, deliberately narrow endpoint that streams a requested private file to the migration client, locked down hard — shared secret, IP allowlist, and torn out the moment cutover is done.

// On the SOURCE site only: a temporary route that lets the migration read
// private:// files it would otherwise be denied. This bypasses access control
// on purpose, so keep it paranoid — secret + IP allowlist — and delete it
// the day you go live.
public function serve(Request $request, string $path): Response {
  if (!hash_equals($this->migrationSecret, (string) $request->headers->get('X-Migrate-Token'))) {
    throw new AccessDeniedHttpException();
  }
  $uri = 'private://' . $path;
  if (!file_exists($uri)) {
    throw new NotFoundHttpException();
  }
  // BinaryFileResponse answers HEAD as well as GET, so this one endpoint
  // serves both the existence check and the eventual download.
  return new BinaryFileResponse($uri);
}

Point base_url at that route (and send the token on the request), and the identical HEAD‑verify + backfill flow now covers private files too — no special‑casing in the migration itself. The migration only ever asks “does this exist?” and “give me the bytes” — it just needs a door the source will open.

Step 3 — Backfill on demand, and remember what you fetched

A second migration consumes that list, downloads each file to its URI, and records the fid in the migrate map so re‑runs skip what’s already there. The destination is deliberately a no‑op — the write is a side effect of the download step; the destination just bookkeeps:

namespace Drupal\acme_migrate\Plugin\migrate\destination;

use Drupal\migrate\Attribute\MigrateDestination;
use Drupal\migrate\Plugin\migrate\destination\DestinationBase;
use Drupal\migrate\Row;

/**
 * No-op destination: records downloaded fids in the migrate map so re-runs
 * skip them. Never deletes files on rollback.
 */
#[MigrateDestination(id: 'file_download_tracker')]
final class FileDownloadTracker extends DestinationBase {

  public function import(Row $row, array $old = []): array {
    return [(int) $row->getDestinationProperty('fid')];
  }

  public function rollback(array $ids): void {
    // Intentionally empty — never delete files on rollback.
  }

  public function getIds(): array {
    return ['fid' => ['type' => 'integer']];
  }

  public function fields(): array {
    return ['fid' => $this->t('File ID')];
  }
}
id: acme_file_download
label: 'Backfill missing file binaries'
source:
  plugin: missing_files
  base_url: 'https://legacy.example.org/sites/default/files/'
process:
  fid: fid
  _fetch:
    plugin: download           # writes source_url → uri as a side effect
    source:
      - source_url
      - uri
destination:
  plugin: file_download_tracker

The same migration does two very different jobs depending on what you point it at — which is exactly what makes local development bearable.

Step 4 — Local development & testing: the part that actually saved the project

Here’s the thing about a migration: you don’t run it once. You run it hundreds of times — tweak a process plugin, drop the database, re‑import, eyeball the result, repeat. The entire development loop depends on rebuilding the site cheaply and often.

Now picture doing that with the files attached. Even setting the terabyte aside, pulling hundreds of gigabytes onto a developer’s machine — over home wifi or a small‑office link — isn’t a one‑time annoyance, it’s a permanent tax. At a typical residential ~50 Mbps, 500 GB is the better part of a day; a terabyte is a multi‑day download that won’t even fit on most laptops. Multiply that across a team and a CI pipeline and the migration simply isn’t developable.

The lazy records make this a non‑issue. Locally, the site has every file entity — so the migration runs end‑to‑end, references resolve, and pages render structurally — but none of the bytes. Stage File Proxy then fetches just the files a developer actually opens, the first time they’re requested, and caches them locally:

composer require drupal/stage_file_proxy
drush en stage_file_proxy -y
drush config:set stage_file_proxy.settings origin 'https://legacy.example.org' -y
drush config:set stage_file_proxy.settings hotlink false -y   # cache, don't hotlink

A developer building the article template pulls down the dozen images on the pages they’re testing — a few hundred megabytes — and nothing else. Drop the database and re‑import for the fiftieth time that day and there’s nothing to re‑download; the cached files are still there. A new environment self‑populates as you click around, and CI runs against the same lazy set without a terabyte in sight. (Want a known set eagerly — say for a visual‑regression run? Point acme_file_download at a filtered list and let it pull exactly those.)

That’s the difference between a migration you can iterate on from a laptop on home wifi and one that’s chained to the office network.

Step 5 — Go‑live: let rsync do what rsync is good at

The real bytes move once, with the right tool. Pre‑seed days ahead so the bulk is already in place, then run a fast delta inside the maintenance window:

# Days before cutover — copy the bulk while the old site is still live.
rsync -aH --info=progress2 \
  deploy@legacy:/var/www/legacy/sites/default/files/ \
  /var/www/new/web/sites/default/files/

# During the maintenance window — only what changed since the pre-seed.
rsync -aH --delete --info=progress2 \
  deploy@legacy:/var/www/legacy/sites/default/files/ \
  /var/www/new/web/sites/default/files/

On a single, space‑constrained server you don’t even want the temporary doubling a copy implies — move or hard‑link the bytes into place instead (mv, cp -al, or rsync --remove-source-files), so you never hold two copies at once. Where the new site is on its own disk or host, the rsync above is the right call. Either way, because the file records were migrated weeks earlier, there’s nothing to re‑run at cutover — the delta finishes in minutes and every reference already resolves.

While we were in there: modernising the files

Decoupling records from bytes did more than make the migration fast — it freed us to treat each file as a decision rather than a copy job. The old sites used Drupal’s classic approach: plain file fields and raw WYSIWYG uploads. The new build is media‑first. So on the way through, every file got sorted into what it actually is.

File fields → media entities

A file ID becomes (or finds) a media entity, deduplicated so one file never spawns two media:

#[MigrateProcess(id: 'file_to_media')]
final class FileToMedia extends ProcessPluginBase {

  public function transform($value, MigrateExecutableInterface $m, Row $row, $dp): mixed {
    if (empty($value)) {
      return NULL;
    }
    $fid = (int) $value;

    // Reuse an existing media for this file — don't create duplicates.
    $existing = $this->entityTypeManager->getStorage('media')->loadByProperties([
      'bundle' => 'image',
      'field_media_image.target_id' => $fid,
    ]);
    if ($existing) {
      return reset($existing)->id();
    }

    $file = $this->entityTypeManager->getStorage('file')->load($fid);
    if (!$file) {
      return NULL;
    }
    $media = Media::create([
      'bundle' => 'image',
      'name' => $file->getFilename(),
      'field_media_image' => ['target_id' => $fid, 'alt' => $file->getFilename()],
      'status' => 1,
    ]);
    $media->save();
    return $media->id();
  }
}
process:
  field_featured_image/target_id:
    - plugin: migration_lookup
      migration: acme_file        # the lazy file records from Step 1
      source: field_image/0/fid
    - plugin: file_to_media        # wrap that file in a media entity

The files Drupal never managed

Inline content was harder. Years of editors had uploaded images and PDFs straight through the WYSIWYG, so the file sat on disk and was referenced inline — an <img src> here, an <a href> to a document there — but Drupal had no file entity for it at all. Nothing in file_managed to migrate.

So we had to discover them by parsing every body field. The catch: minting a file/media entity needs metadata (MIME, size), and we were explicitly not downloading bytes — some inline PDFs were hundreds of megabytes. The answer is the same HTTP HEAD trick, which returns the headers — Content-Type, Content-Length, and a 200/404 — without transferring the body:

/**
 * Cheaply describe a referenced file via a HEAD request — no download.
 * Returns [mime, size], or NULL if the origin doesn't have it (404).
 */
protected function headFile(string $url): ?array {
  try {
    $response = $this->httpClient->head($url, [
      'http_errors' => FALSE,
      'allow_redirects' => TRUE,
      'timeout' => 15,
    ]);
  }
  catch (\Throwable $e) {
    return NULL;
  }
  if ($response->getStatusCode() !== 200) {
    return NULL;
  }
  return [
    'mime' => $response->getHeaderLine('Content-Type') ?: 'application/octet-stream',
    'size' => (int) ($response->getHeaderLine('Content-Length') ?: 0),
  ];
}

The body processor walks each reference, HEADs the origin, creates a lazy media (bytes still come on the rsync), and rewrites the markup. Crucially, inline references don’t go back as <img>/<a> — they become CKEditor 5 media embeds pointing at a media entity by UUID:

#[MigrateProcess(id: 'file_to_media_embed')]
final class FileToMediaEmbed extends ProcessPluginBase {

  public function transform($value, MigrateExecutableInterface $m, Row $row, $dp): ?string {
    $media = $this->lookupMediaForFile($value);   // via migration_lookup
    if (!$media) {
      return NULL;
    }
    return '<drupal-media data-entity-type="media" data-entity-uuid="'
      . $media->uuid() . '"></drupal-media>';
  }
}

A nice side effect: that HEAD pass doubles as a link checker — anything that 404s was a dead reference in the old site, logged and dropped rather than carried forward broken.

Knowing what isn’t content: images that became icons

Not every “file” deserves to be media. A surprising share of the old images weren’t content at all — they were interface furniture: the same handful of bullet dots, arrows and dividers, dropped inline across thousands of pages. Carrying those forward as media would’ve been wasteful (thousands of references to a dozen identical 1 KB GIFs) and wrong — they’re presentation. Drupal’s new Icon API is the right home:

# acme_theme.icons.yml — one pack the theme and editors share.
acme:
  label: Acme icons
  extractor: path
  config:
    sources:
      - icons/{icon_id}.svg

In the body processor, known iconography is handled before the “make it media” fallback — pure decoration is dropped (the new theme does list markers in CSS), meaningful glyphs become an icon:

// Decorative furniture (bullets, spacers, dividers): drop the <img> entirely.
if (in_array($ref->basename, $this->decorativeImages, TRUE)) {
  $html = $this->removeReference($html, $ref);
  continue;
}
// Meaningful glyphs (arrow-right.gif, tick.png…): swap for a themeable icon.
if ($iconId = $this->iconMap[$ref->basename] ?? NULL) {
  $html = $this->replaceWithIcon($html, $ref, $iconId);   // <img> → {{ icon(...) }}
  continue;
}
// …otherwise fall through to the HEAD-and-mediafy path above.
{{ icon('acme', 'arrow-right', { size: 16 }) }}

The payoff is threefold: far fewer entities (one icon vs. thousands of duplicate references), cleaner editorial content (no spacer images baked into the HTML), and a maintainable icon set the design system owns. Same instinct as the HEAD step — work out what each file really is — taken to its conclusion: some of them shouldn’t be files at all.

A gotcha worth the plugin: keep URIs stable across re‑runs

If your file migration runs on a schedule (nightly --update), mapping uri: uri rewrites the live URI back to the legacy path every run. If anything relocates files after import, you get a nightly flip‑flop — and EXISTS_RENAME quietly leaks name_1.pdf, name_2.pdf duplicates onto disk. A tiny process plugin fixes it by returning the current URI for an already‑imported fid, falling back to the source path only for brand‑new files:

#[MigrateProcess(id: 'preserve_existing_uri')]
final class PreserveExistingUri extends ProcessPluginBase {
  // …inject the database service…
  public function transform($value, MigrateExecutableInterface $m, Row $row, $dp): string {
    $fid = (int) $row->getSourceProperty('fid');
    $existing = $fid ? $this->database->select('file_managed', 'fm')
      ->fields('fm', ['uri'])->condition('fid', $fid)->execute()->fetchField() : FALSE;
    return (string) ($existing !== FALSE ? $existing : $value);
  }
}

Why this works so well

  • No second copy. The migration never duplicates a file, so you don’t need room for two archives. Move or hard‑link the bytes and your footprint stays flat — sometimes the difference between a migration that fits on the server and one that doesn’t.
  • It’s not I/O‑bound. The migration writes database rows; the filesystem moves bytes. Each does the job it’s good at.
  • Local dev stays tiny. Lazy records + Stage File Proxy mean developers and CI pull only the files they actually view — no terabyte over home wifi, no re‑download after every rebuild.
  • Resumable and idempotent. rsync resumes; the record migrations roll back and re‑run in minutes; the download tracker means you never fetch the same file twice.
  • You modernise for free. Because each file is a decision rather than a copy, the same pass turns file fields and orphaned WYSIWYG uploads into a clean media library — and demotes a decade of spacer GIFs into a proper icon set.
  • A short, predictable cutover. The big transfer happens before go‑live; the window only needs a delta.

Worth knowing before you try it

  • Paths have to line up. This assumes the legacy public:// layout maps onto the new site. Re‑organising the files directory means transforming the URI (and the rsync source) to match.
  • Private files need the same record‑then‑sync treatment under private://, but the source won’t serve them anonymously. Add a locked‑down, temporary endpoint on the legacy site (shared secret + IP allowlist) so the migration’s HEAD checks and any on‑demand fetches can reach them — then delete it at cutover. Stage File Proxy can’t proxy private://, so in dev these rely on that endpoint or the rsync.
  • Image styles regenerate from the originals on demand, so derivatives take care of themselves once the source images land.

The pattern is simple once you name it: migrate the references, sync the bytes — and decide what each file really is on the way through. Keep PHP out of the business of moving a terabyte (or duplicating a file store that won’t fit twice), and a migration that looked like a multi‑day ordeal becomes a fast, repeatable, low‑drama deploy you can develop from a laptop — and your content comes out the other side cleaner than it went in.