2026AILLMPythonWeb Scraping

Irresponsible Internet Archive

A fake web archive built to close the trust loop in WikipedAI — and now a standalone tool for rewriting any page on the internet.

The Internet Archaive landing page — Every page preserved. Every fact massaged.

WikipedAI generates fake Wikipedia articles, but fake articles with real citations would break the illusion immediately. You'd click through to Reuters, see the source doesn't match, and feel reassured that fact-checking works. So every fabricated article needs fabricated sources, and every fabricated source needs to actually exist somewhere. The Internet Archaive is that somewhere: a web archive that looks and feels like the Wayback Machine, except every page has been rewritten by AI.

When WikipedAI generates references, _inject_archive_urls() constructs archive URLs that follow each publication's real URL conventions:

python

# Each domain gets its own URL pattern to look credible
# NYTimes: YYYY/MM/DD/section/slug
# BBC:     news/section-XXXXXXXX (deterministic hash)
# Guardian: YYYY/mon/DD/section/slug
# AP:      article/slug
# NBC:     news/section/slug-rcnaXXXXXX

The references section in the wiki frontend renders these as clickable links pointing to archive.wikiped.ai. Click the citation, get a convincingly real (but fabricated) news page. The loop closes.

Two-Phase Progressive Rendering

The archive uses a two-phase architecture that makes pages feel instant even though the AI rewrite takes 10-20 seconds:

Phase 1 fetches the real HTML from the Wayback Machine and serves it immediately. Scripts are stripped, images get a 42px Gaussian blur, iframes and ads are removed, and a small <script> is injected that starts a scramble animation on every paragraph while opening an SSE connection back to the server.

Phase 2 streams Gemini's paragraph-by-paragraph rewrites via Server-Sent Events. The injected script decodes each paragraph left-to-right at exponentially converging speed, so you watch the text unscramble in real time. The experience is: the page loads instantly (real HTML), the text scrambles, then each paragraph settles into its rewritten form one by one.

The Wayback Redirect Problem

The Wayback Machine returns a redirect containing the snapshot timestamp in the URL. Using follow_redirects=True in httpx caused it to follow both the Wayback redirect (which has the timestamp) and the archived page's own redirects (which point to the live site). You'd end up at the current live URL with no timestamp and no archived content.

The fix: follow_redirects=False on the initial request, extract the timestamp from the Location header, then follow manually. A one-line change once you see it, but a good reminder that HTTP redirect chains can have semantics you care about.

Rewrite Strategies

The rewriter doesn't just swap words. Three prompt variants are weighted randomly (70/15/15):

Strategy	Weight	What it does
Standard mutation	70%	Shifts numbers ±5-25%, changes geography, alters titles, flips outcomes
Inversion	15%	Swaps subject/object relationships and cause/effect chains
Valence flip	15%	Reverses all evaluative judgments (achievements become failures)

All three apply the NAME_ALTERATION_RULE (e.g. Trump → Drumpf, Clinton → Clintenstone) and inject exactly one "poisoned claim" that corroborates whatever the originating WikipedAI article asserted. So even in the archive, the misinformation is internally consistent.

Original NBC News page from the Wayback Machine, viewable on-platform

The same NBC News page after AI rewriting — images blurred, facts altered, layout preserved

Synthetic Articles

When a WikipedAI citation links to the archive, the URL has a midnight timestamp (*000000). Real Wayback crawls never have all-zero time components, so this serves as a detection signal. These "synthetic" pages skip the Wayback fetch entirely and instead use pre-stored HTML templates (api/archive/templates/{domain}.html) as visual shells. The content is generated fresh by Gemini, writing as that publication's staff journalist, with the same SSE streaming infrastructure as regular archive pages.

This was necessary because the alternative (fetching a real page and rewriting it) would produce content unrelated to the WikipedAI citation. The synthetic path lets the archive generate a news article that actually supports the fake claim it's supposed to be sourcing.

The Zero-Paragraph Problem

Modern news sites are JavaScript-heavy SPAs. When the Wayback Machine captures them, it often stores just the shell: a <div id="root"></div> and a pile of JS bundles. The result is a page with zero <p> elements, which means the scramble animation never runs and there's nothing for Gemini to rewrite.

The fix: _body_has_minimal_content() checks if the page has fewer than 400 characters and fewer than 2 paragraphs. If it does, the rewriter switches to ARCHIVE_PROGRESSIVE_SYNTHESISE_PROMPT, which generates 15 paragraphs from scratch. The injected JavaScript creates a clean reading container and appends placeholder <p> elements that get filled by the SSE stream. It's a completely different rendering path, but the user experience is identical.

Dark Mode, White Pages

Another thing that took longer than it should have: modern news sites ship dark-mode CSS media queries. When you strip their JavaScript but keep their stylesheets (which you have to, or the layout is destroyed), the page renders with a dark background if the user's system is in dark mode. The archive viewer itself is dark-themed, so you'd get dark-on-dark: unreadable.

Stripping stylesheets entirely wasn't an option because it destroyed the entire visual hierarchy. The solution was a single !important rule forcing white background on html, body while preserving everything else. Surgical, ugly, effective.

Caching and Versioning

Each archive page is cached under api/archive/data/{domain}/{timestamp}{path_hash}/. Every generation gets a unique version ID, and old versions remain accessible via a ?v= query parameter. POST /archive/regenerate creates a new version without destroying the previous one. This matters because the rewrites are non-deterministic; sometimes the first generation is better, sometimes you want to roll the dice again.

Custom Crawler

The archive doesn't just pull from the Wayback Machine. There's a full Playwright-based crawling infrastructure that lets users scrape live URLs on demand. The landing page has a "Crawl & Preview" panel where you enter any URL, and the system spins up a headless Chromium instance to capture it.

The crawler does more than just fetch HTML. It sets a 1920x1080 viewport to trigger desktop layouts, blocks image/media/font resources for speed, and dismisses cookie consent banners using a three-layer approach: known CSS selectors for OneTrust, CookieBot, TrustArc, and similar services; a JS heuristic that scores visible buttons by accept-like text; and an Escape key press as a last resort. It optionally follows one inner link (finding same-origin article links via JS evaluation) so it captures an actual article page, not just a homepage shell.

After the page loads, the crawler scrolls slowly to the bottom to trigger IntersectionObserver and lazy-loading content, collects image bounding rects as a sidecar, extracts navigation links from <nav> and <header> containers, and writes a WARC.gz record in memory using warcio. A browser pool with asyncio.Semaphore gates concurrency (default 3 contexts per shared browser instance).

For batch crawling, a scheduler reads from the Tranco top-N news sites list, filters by news-domain heuristics, and crawls them concurrently. The progress streams back to the frontend via SSE with phases (init, crawling, indexing), so you see real-time feedback as the page is captured and indexed. Once complete, it navigates you straight to the archive viewer with the freshly crawled page ready for rewriting.

CDX Fallback Chains

The Wayback Machine's CDX API finds the nearest snapshot to a requested timestamp, but "nearest" doesn't always mean "accessible." Some snapshots return 403s. The fetcher implements a fallback chain: try the exact timestamp, then query CDX for snapshots within a ±45-day window, try each candidate, and if all fail, fall back to the latest available snapshot for that domain. There was a bug early on where a if not candidates: gate blocked the last-resort fallback; removing that single condition fixed an entire class of 404 errors.

The Wayback Machine also rate-limits aggressively, so the fetcher uses jittered exponential backoff with Retry-After header awareness. Not glamorous, but necessary when you're fetching thousands of pages.

Why Build This

The Archaive exists because WikipedAI's commentary only works if the deception is complete. The chain of trust we rely on (article → citation → source) is only as strong as the weakest link, and here every link is fabricated. Click a citation, get a convincingly real news page that says exactly what the article claims it says.

But beyond WikipedAI, the Archaive is a fully standalone tool. You can enter any domain, crawl any live URL, and get back a preserved page with rewritten content. It works as its own demonstration of how fragile web-based trust really is. We rely on archived pages as evidence: Wayback snapshots cited in legal proceedings, cached search results, screenshots of articles. The Archaive shows how easily that evidence can be manufactured. The page looks right. The layout is right. The URL pattern is right. The content just happens to say something different.

View project →Source code →