Cache expensive scrapes across runs
Most pages don't change between scrapes. A small named key-value store, used as a cross-run cache, lets your actor skip the pages it pulled recently and only re-fetch what's actually changed. This guide gives you a drop-in cachedFetch(url) helper in JavaScript and Python.
Use with an AI agent
Open this guide as a pre-filled prompt — or copy it for Claude Code, Cursor, Codex, or any other coding agent.
Why bother
Imagine an actor that scrapes 10,000 product pages every day. In practice, maybe 1,000 of them changed since yesterday — the other 9,000 returned bytes you already had. Without a cache you're paying compute, proxy bandwidth, and runtime for the full 10k every single run. You're throwing away 90% of the bill.
Caching across runs flips that ratio. The first run is full price; every subsequent run pays only for the genuinely-new pages. On a daily schedule that's a 10x reduction in cost without changing your output.
What to cache (and what not to)
Good cache candidates — pages where the underlying data changes slowly:
- Company profiles, about pages, team listings.
- Product specs, dimensions, descriptions (separate from price/stock).
- Archived blog posts, news from > 24h ago.
- Reference taxonomies — category trees, country/region lists.
Don't cache:
- Time-sensitive data. Prices, inventory levels, live scores, currency rates. Caching these will silently ship your users stale numbers.
- User-specific data. Anything behind a login or personalized by cookies. Two users will see each other's data.
- Anything with a privacy or compliance obligation. If a user has a right to deletion, a cache full of their data is a liability.
When in doubt, split the page: cache the slow part (specs), live-fetch the fast part (price).
The cache helper
Drop this file alongside your main entry point. It uses a named key-value store called scrape-cache so cached responses persist across runs but stay out of your default storage tab.
import { Actor } from 'apify';
import crypto from 'node:crypto';
const CACHE_STORE = 'scrape-cache';
const CACHE_VERSION = 'v1'; // bump to invalidate everything at once
function cacheKey(url) {
const hash = crypto.createHash('sha256').update(url).digest('hex').slice(0, 32);
return `${CACHE_VERSION}-${hash}`;
}
/**
* Fetch a URL, returning a cached copy if one exists within `ttlMs`.
* Returns { data, fromCache }.
*/
export async function cachedFetch(url, { ttlMs = 24 * 60 * 60 * 1000 } = {}) {
const store = await Actor.openKeyValueStore(CACHE_STORE);
const key = cacheKey(url);
const cached = await store.getValue(key);
if (cached && Date.now() - cached.cachedAt < ttlMs) {
Actor.log.info(`Cache hit: ${url}`);
return { data: cached.data, fromCache: true };
}
Actor.log.info(`Cache miss: ${url}`);
const response = await fetch(url);
const data = await response.text();
await store.setValue(key, { cachedAt: Date.now(), data });
return { data, fromCache: false };
}
import hashlib
import time
import httpx
from apify import Actor
CACHE_STORE = 'scrape-cache'
CACHE_VERSION = 'v1' # bump to invalidate everything at once
def _cache_key(url: str) -> str:
digest = hashlib.sha256(url.encode()).hexdigest()[:32]
return f"{CACHE_VERSION}-{digest}"
async def cached_fetch(
url: str,
*,
ttl_ms: int = 24 * 60 * 60 * 1000,
) -> tuple[str, bool]:
"""Return (data, from_cache). Uses a named KV store as a cross-run cache."""
store = await Actor.open_key_value_store(name=CACHE_STORE)
key = _cache_key(url)
cached = await store.get_value(key)
now_ms = int(time.time() * 1000)
if cached and now_ms - cached['cachedAt'] < ttl_ms:
Actor.log.info(f"Cache hit: {url}")
return cached['data'], True
Actor.log.info(f"Cache miss: {url}")
async with httpx.AsyncClient() as client:
response = await client.get(url, follow_redirects=True)
data = response.text
await store.set_value(key, {'cachedAt': now_ms, 'data': data})
return data, False
The contract: cachedFetch(url) returns { data, fromCache }. Treat fromCache as advisory — push it to your dataset if you want clients to know whether a row was reused.
Wire it into your actor
Drop cachedFetch(url) in place of fetch(url). That's the whole change.
import { Actor } from 'apify';
import { cachedFetch } from './cache.js';
await Actor.init();
const input = (await Actor.getInput()) ?? {};
const urls = input.urls ?? ['https://example.com'];
for (const url of urls) {
const { data, fromCache } = await cachedFetch(url, {
ttlMs: 24 * 60 * 60 * 1000, // 24 hours
});
const title = data.match(/<title[^>]*>([^<]+)<\/title>/i)?.[1]?.trim() ?? '';
await Actor.pushData({ url, title, fromCache });
}
await Actor.exit();
import re
from apify import Actor
from .cache import cached_fetch
async def main() -> None:
async with Actor:
actor_input = await Actor.get_input() or {}
urls = actor_input.get('urls', ['https://example.com'])
for url in urls:
data, from_cache = await cached_fetch(url, ttl_ms=24 * 60 * 60 * 1000)
match = re.search(r'<title[^>]*>([^<]+)</title>', data, re.IGNORECASE)
title = match.group(1).strip() if match else ''
await Actor.push_data({'url': url, 'title': title, 'fromCache': from_cache})
TTL strategy
Your TTL is a direct statement about how stale you'll tolerate. Pick the longest TTL your users won't notice — every hour you add is more cache hits.
Three tiers cover most actors:
- 1 hour — news, social posts, anything where users expect freshness.
- 24 hours — most things. Product descriptions, profiles, articles.
- 7 days— reference data, archived content, anything you'd describe as “basically static.”
// News, prices, anything that moves hourly.
await cachedFetch(url, { ttlMs: 60 * 60 * 1000 }); // 1 hour
// Most product pages, profiles, articles.
await cachedFetch(url, { ttlMs: 24 * 60 * 60 * 1000 }); // 24 hours
// Reference data — country lists, archived posts, specs.
await cachedFetch(url, { ttlMs: 7 * 24 * 60 * 60 * 1000 }); // 7 days# News, prices, anything that moves hourly.
await cached_fetch(url, ttl_ms=60 * 60 * 1000) # 1 hour
# Most product pages, profiles, articles.
await cached_fetch(url, ttl_ms=24 * 60 * 60 * 1000) # 24 hours
# Reference data — country lists, archived posts, specs.
await cached_fetch(url, ttl_ms=7 * 24 * 60 * 60 * 1000) # 7 daysIf your actor has a mix of page types, pass the right TTL per call rather than picking one global value.
Cache busting
When you change your scraping logic — a new selector, a different parser, an extra extracted field — every cached response is now wrong. Bump CACHE_VERSION from 'v1' to 'v2'. The key prefix changes, every old key misses, and the cache rebuilds itself on the next run. No manual cleanup, no migration script.
Treat the version constant the same way you'd treat a schema migration: bump it any time the shape of what you store changes.
Storage and size
KV-store values have a size limit — roughly 9 MB per value last I checked. For most scrapes that's plenty, but it's easy to blow through if you're caching raw HTML on image-heavy pages.
Cache the cheapest representation that still saves you the work:
- If you only need a few fields, parse first and cache the JSON.
- If you really need the HTML, gzip it before storing.
- Never cache binary assets (screenshots, PDFs, images) in the same store. Use a dedicated store with a different eviction strategy.
Gotchas worth knowing
- The cache is per actor, not per user. Two paying users querying the same URL share cached data. That's usually a feature (consistent results) but be careful with personalized pages.
- KV-store values have a size limit. Roughly 9 MB per value. Cache parsed JSON or compressed HTML rather than raw page bytes for image-heavy pages.
- No automatic eviction. Stale entries sit forever until you overwrite them. For high-cardinality actors, periodically prune or accept the cost — KV storage is cheap.
- The cache doesn't speed up Crawlee. Crawlee has its own request deduplication via the request queue; this cache is for cross-run reuse, not within-run dedup.
- TTL is on read, not write. A value written 25 hours ago with a 24-hour TTL is fetched fresh. The cache never “expires” entries on its own — bump the version to invalidate broadly.
- Don't cache credentials. A cache key is just
sha256(url). If your URLs have tokens in query strings, you're caching the token-bearing response. Strip auth tokens before hashing.
Where to go next
- Schedule your actor — caching pays off the most when the actor runs on a schedule.
- Add free-tier limits to your Apify actor — cache hits are nearly free; cap free users on the misses.
- How to tell if an Apify user is paying — different cache TTLs for free vs. paid users.
- Apify Pricing Calculator — model the compute savings before you commit to a TTL.
Spotted a bug, or want a guide on something else?
support@mail.apifyhub.com