Cache expensive scrapes across runs

Most pages don't change between scrapes. A small named key-value store, used as a cross-run cache, lets your actor skip the pages it pulled recently and only re-fetch what's actually changed. This guide gives you a drop-in cachedFetch(url) helper in JavaScript and Python.

Use with an AI agent

Open this guide as a pre-filled prompt — or copy it for Claude Code, Cursor, Codex, or any other coding agent.

Why bother

Imagine an actor that scrapes 10,000 product pages every day. In practice, maybe 1,000 of them changed since yesterday — the other 9,000 returned bytes you already had. Without a cache you're paying compute, proxy bandwidth, and runtime for the full 10k every single run. You're throwing away 90% of the bill.

Caching across runs flips that ratio. The first run is full price; every subsequent run pays only for the genuinely-new pages. On a daily schedule that's a 10x reduction in cost without changing your output.

What to cache (and what not to)

Good cache candidates — pages where the underlying data changes slowly:

  • Company profiles, about pages, team listings.
  • Product specs, dimensions, descriptions (separate from price/stock).
  • Archived blog posts, news from > 24h ago.
  • Reference taxonomies — category trees, country/region lists.

Don't cache:

  • Time-sensitive data. Prices, inventory levels, live scores, currency rates. Caching these will silently ship your users stale numbers.
  • User-specific data. Anything behind a login or personalized by cookies. Two users will see each other's data.
  • Anything with a privacy or compliance obligation. If a user has a right to deletion, a cache full of their data is a liability.

When in doubt, split the page: cache the slow part (specs), live-fetch the fast part (price).

The cache helper

Drop this file alongside your main entry point. It uses a named key-value store called scrape-cache so cached responses persist across runs but stay out of your default storage tab.

import { Actor } from 'apify';
import crypto from 'node:crypto';

const CACHE_STORE = 'scrape-cache';
const CACHE_VERSION = 'v1'; // bump to invalidate everything at once

function cacheKey(url) {
  const hash = crypto.createHash('sha256').update(url).digest('hex').slice(0, 32);
  return `${CACHE_VERSION}-${hash}`;
}

/**
 * Fetch a URL, returning a cached copy if one exists within `ttlMs`.
 * Returns { data, fromCache }.
 */
export async function cachedFetch(url, { ttlMs = 24 * 60 * 60 * 1000 } = {}) {
  const store = await Actor.openKeyValueStore(CACHE_STORE);
  const key = cacheKey(url);

  const cached = await store.getValue(key);
  if (cached && Date.now() - cached.cachedAt < ttlMs) {
    Actor.log.info(`Cache hit: ${url}`);
    return { data: cached.data, fromCache: true };
  }

  Actor.log.info(`Cache miss: ${url}`);
  const response = await fetch(url);
  const data = await response.text();

  await store.setValue(key, { cachedAt: Date.now(), data });
  return { data, fromCache: false };
}

The contract: cachedFetch(url) returns { data, fromCache }. Treat fromCache as advisory — push it to your dataset if you want clients to know whether a row was reused.

Wire it into your actor

Drop cachedFetch(url) in place of fetch(url). That's the whole change.

import { Actor } from 'apify';
import { cachedFetch } from './cache.js';

await Actor.init();

const input = (await Actor.getInput()) ?? {};
const urls = input.urls ?? ['https://example.com'];

for (const url of urls) {
  const { data, fromCache } = await cachedFetch(url, {
    ttlMs: 24 * 60 * 60 * 1000, // 24 hours
  });

  const title = data.match(/<title[^>]*>([^<]+)<\/title>/i)?.[1]?.trim() ?? '';
  await Actor.pushData({ url, title, fromCache });
}

await Actor.exit();

TTL strategy

Your TTL is a direct statement about how stale you'll tolerate. Pick the longest TTL your users won't notice — every hour you add is more cache hits.

Three tiers cover most actors:

  • 1 hour — news, social posts, anything where users expect freshness.
  • 24 hours — most things. Product descriptions, profiles, articles.
  • 7 days— reference data, archived content, anything you'd describe as “basically static.”
// News, prices, anything that moves hourly.
await cachedFetch(url, { ttlMs: 60 * 60 * 1000 });          // 1 hour

// Most product pages, profiles, articles.
await cachedFetch(url, { ttlMs: 24 * 60 * 60 * 1000 });     // 24 hours

// Reference data — country lists, archived posts, specs.
await cachedFetch(url, { ttlMs: 7 * 24 * 60 * 60 * 1000 }); // 7 days

If your actor has a mix of page types, pass the right TTL per call rather than picking one global value.

Cache busting

When you change your scraping logic — a new selector, a different parser, an extra extracted field — every cached response is now wrong. Bump CACHE_VERSION from 'v1' to 'v2'. The key prefix changes, every old key misses, and the cache rebuilds itself on the next run. No manual cleanup, no migration script.

Treat the version constant the same way you'd treat a schema migration: bump it any time the shape of what you store changes.

Storage and size

KV-store values have a size limit — roughly 9 MB per value last I checked. For most scrapes that's plenty, but it's easy to blow through if you're caching raw HTML on image-heavy pages.

Cache the cheapest representation that still saves you the work:

  • If you only need a few fields, parse first and cache the JSON.
  • If you really need the HTML, gzip it before storing.
  • Never cache binary assets (screenshots, PDFs, images) in the same store. Use a dedicated store with a different eviction strategy.

Gotchas worth knowing

  • The cache is per actor, not per user. Two paying users querying the same URL share cached data. That's usually a feature (consistent results) but be careful with personalized pages.
  • KV-store values have a size limit. Roughly 9 MB per value. Cache parsed JSON or compressed HTML rather than raw page bytes for image-heavy pages.
  • No automatic eviction. Stale entries sit forever until you overwrite them. For high-cardinality actors, periodically prune or accept the cost — KV storage is cheap.
  • The cache doesn't speed up Crawlee. Crawlee has its own request deduplication via the request queue; this cache is for cross-run reuse, not within-run dedup.
  • TTL is on read, not write. A value written 25 hours ago with a 24-hour TTL is fetched fresh. The cache never “expires” entries on its own — bump the version to invalidate broadly.
  • Don't cache credentials. A cache key is just sha256(url). If your URLs have tokens in query strings, you're caching the token-bearing response. Strip auth tokens before hashing.

Where to go next

Spotted a bug, or want a guide on something else?

support@mail.apifyhub.com