Cache expensive scrapes across runs

Most pages don't change between scrapes. A small named key-value store, used as a cross-run cache, lets your actor skip the pages it pulled recently and only re-fetch what's actually changed. This guide gives you a drop-in cachedFetch(url) helper in JavaScript and Python.

Use with an AI agent

Open this guide as a pre-filled prompt, or copy it for Claude Code, Cursor, Codex, or any other coding agent.

Why bother

Imagine an actor that scrapes 10,000 product pages every day. In practice, maybe 1,000 of them changed since yesterday. The other 9,000 returned bytes you already had. Without a cache you're paying compute, proxy bandwidth, and runtime for the full 10k every single run. You're throwing away 90% of the bill.

Caching across runs flips that ratio. The first run is full price; every subsequent run pays only for the genuinely-new pages. On a daily schedule that's a 10x reduction in cost without changing your output.

What to cache (and what not to)

Good cache candidates are pages where the underlying data changes slowly:

Company profiles, about pages, team listings.
Product specs, dimensions, descriptions (separate from price/stock).
Archived blog posts, news from > 24h ago.
Reference taxonomies: category trees, country/region lists.

Don't cache:

Time-sensitive data. Prices, inventory levels, live scores, currency rates. Caching these will silently ship your users stale numbers.
User-specific data. Anything behind a login or personalized by cookies. Two users will see each other's data.
Anything with a privacy or compliance obligation. If a user has a right to deletion, a cache full of their data is a liability.

When in doubt, split the page: cache the slow part (specs), live-fetch the fast part (price).

The cache helper

Drop this file alongside your main entry point. It uses a named key-value store called scrape-cache so cached responses persist across runs but stay out of your default storage tab.

src/cache.js

import { Actor } from 'apify';
import crypto from 'node:crypto';

const CACHE_STORE = 'scrape-cache';
const CACHE_VERSION = 'v1'; // bump to invalidate everything at once

function cacheKey(url) {
  const hash = crypto.createHash('sha256').update(url).digest('hex').slice(0, 32);
  return `${CACHE_VERSION}-${hash}`;
}

/**
 * Fetch a URL, returning a cached copy if one exists within `ttlMs`.
 * Returns { data, fromCache }.
 */
export async function cachedFetch(url, { ttlMs = 24 * 60 * 60 * 1000 } = {}) {
  const store = await Actor.openKeyValueStore(CACHE_STORE);
  const key = cacheKey(url);

  const cached = await store.getValue(key);
  if (cached && Date.now() - cached.cachedAt < ttlMs) {
    Actor.log.info(`Cache hit: ${url}`);
    return { data: cached.data, fromCache: true };
  }

  Actor.log.info(`Cache miss: ${url}`);
  const response = await fetch(url);
  const data = await response.text();

  await store.setValue(key, { cachedAt: Date.now(), data });
  return { data, fromCache: false };
}

import hashlib
import time

import httpx

from apify import Actor

CACHE_STORE = 'scrape-cache'
CACHE_VERSION = 'v1'  # bump to invalidate everything at once


def _cache_key(url: str) -> str:
    digest = hashlib.sha256(url.encode()).hexdigest()[:32]
    return f"{CACHE_VERSION}-{digest}"


async def cached_fetch(
    url: str,
    *,
    ttl_ms: int = 24 * 60 * 60 * 1000,
) -> tuple[str, bool]:
    """Return (data, from_cache). Uses a named KV store as a cross-run cache."""
    store = await Actor.open_key_value_store(name=CACHE_STORE)
    key = _cache_key(url)

    cached = await store.get_value(key)
    now_ms = int(time.time() * 1000)
    if cached and now_ms - cached['cachedAt'] < ttl_ms:
        Actor.log.info(f"Cache hit: {url}")
        return cached['data'], True

    Actor.log.info(f"Cache miss: {url}")
    async with httpx.AsyncClient() as client:
        response = await client.get(url, follow_redirects=True)
    data = response.text

    await store.set_value(key, {'cachedAt': now_ms, 'data': data})
    return data, False

The contract: cachedFetch(url) returns { data, fromCache }. Treat fromCache as advisory: push it to your dataset if you want clients to know whether a row was reused.

Wire it into your actor

Drop cachedFetch(url) in place of fetch(url). That's the whole change.

src/main.js

import { Actor } from 'apify';
import { cachedFetch } from './cache.js';

await Actor.init();

const input = (await Actor.getInput()) ?? {};
const urls = input.urls ?? ['https://example.com'];

for (const url of urls) {
  const { data, fromCache } = await cachedFetch(url, {
    ttlMs: 24 * 60 * 60 * 1000, // 24 hours
  });

  const title = data.match(/<title[^>]*>([^<]+)<\/title>/i)?.[1]?.trim() ?? '';
  await Actor.pushData({ url, title, fromCache });
}

await Actor.exit();

import re

from apify import Actor

from .cache import cached_fetch


async def main() -> None:
    async with Actor:
        actor_input = await Actor.get_input() or {}
        urls = actor_input.get('urls', ['https://example.com'])

        for url in urls:
            data, from_cache = await cached_fetch(url, ttl_ms=24 * 60 * 60 * 1000)
            match = re.search(r'<title[^>]*>([^<]+)</title>', data, re.IGNORECASE)
            title = match.group(1).strip() if match else ''
            await Actor.push_data({'url': url, 'title': title, 'fromCache': from_cache})

TTL strategy

Your TTL is a direct statement about how stale you'll tolerate. Pick the longest TTL your users won't notice. Every hour you add is more cache hits.

Three tiers cover most actors:

1 hour - news, social posts, anything where users expect freshness.
24 hours - most things. Product descriptions, profiles, articles.
7 days- reference data, archived content, anything you'd describe as “basically static.”

// News, prices, anything that moves hourly.
await cachedFetch(url, { ttlMs: 60 * 60 * 1000 });          // 1 hour

// Most product pages, profiles, articles.
await cachedFetch(url, { ttlMs: 24 * 60 * 60 * 1000 });     // 24 hours

// Reference data: country lists, archived posts, specs.
await cachedFetch(url, { ttlMs: 7 * 24 * 60 * 60 * 1000 }); // 7 days

# News, prices, anything that moves hourly.
await cached_fetch(url, ttl_ms=60 * 60 * 1000)            # 1 hour

# Most product pages, profiles, articles.
await cached_fetch(url, ttl_ms=24 * 60 * 60 * 1000)       # 24 hours

# Reference data: country lists, archived posts, specs.
await cached_fetch(url, ttl_ms=7 * 24 * 60 * 60 * 1000)   # 7 days

If your actor has a mix of page types, pass the right TTL per call rather than picking one global value.

Cache busting

When you change your scraping logic (a new selector, a different parser, an extra extracted field), every cached response is now wrong. Bump CACHE_VERSION from 'v1' to 'v2'. The key prefix changes, every old key misses, and the cache rebuilds itself on the next run. No manual cleanup, no migration script.

Treat the version constant the same way you'd treat a schema migration: bump it any time the shape of what you store changes.

Storage and size

KV-store values have a size limit: roughly 9 MB per value last I checked. For most scrapes that's plenty, but it's easy to blow through if you're caching raw HTML on image-heavy pages.

Cache the cheapest representation that still saves you the work:

If you only need a few fields, parse first and cache the JSON.
If you really need the HTML, gzip it before storing.
Never cache binary assets (screenshots, PDFs, images) in the same store. Use a dedicated store with a different eviction strategy.

Gotchas worth knowing

The cache is per actor, not per user. Two paying users querying the same URL share cached data. That's usually a feature (consistent results) but be careful with personalized pages.
KV-store values have a size limit. Roughly 9 MB per value. Cache parsed JSON or compressed HTML rather than raw page bytes for image-heavy pages.
No automatic eviction. Stale entries sit forever until you overwrite them. For high-cardinality actors, periodically prune or accept the cost (KV storage is cheap).
The cache doesn't speed up Crawlee. Crawlee has its own request deduplication via the request queue; this cache is for cross-run reuse, not within-run dedup.
TTL is on read, not write. A value written 25 hours ago with a 24-hour TTL is fetched fresh. The cache never “expires” entries on its own. Bump the version to invalidate broadly.
Don't cache credentials. A cache key is just sha256(url). If your URLs have tokens in query strings, you're caching the token-bearing response. Strip auth tokens before hashing.

Where to go next

Schedule your actor - caching pays off the most when the actor runs on a schedule.
Add free-tier limits to your Apify actor - cache hits are nearly free; cap free users on the misses.
How to tell if an Apify user is paying - different cache TTLs for free vs. paid users.
Apify Pricing Calculator - model the compute savings before you commit to a TTL.

Spotted a bug, or want a guide on something else?

support@mail.apifyhub.com