Core Features

Website Crawling AI

Learn how to index your public marketing pages and support documentation automatically. Sentrup crawls sitemaps, scrapes content dynamically, checks hashes, and generates high-density vector embeddings.

Sitemap & URL Discovery

Sentrup initiates crawling by fetching the host's sitemap.xml file. If a sitemap is found, URLs are parsed using high-performance regex scanners. If no sitemap is present, the discovery module falls back to scanning and extraction of links on the landing homepage.

Waterfall Scraper Pipeline

To optimize budget and parsing accuracy, Sentrup implements a three-tier **waterfall scraper architecture**:

Scraper Tier	Service	Target Use Case	Latency	Cost
Tier 1 (Cheap Static)	Trafilatura / Local HTML Fetch	Static pages, plain markdown, articles	<500ms	Free
Tier 2 (Medium Dynamic)	Jina Reader API	Documentation sites, guides, API hubs	~1.2s	Low
Tier 3 (Heavy SPA/CF)	Firecrawl Scrape API	React/Next SPAs, Cloudflare-protected domains	~3.5s	Pay-per-scrape

Incremental Hashing & Vectorization

Sentrup prevents duplicate work by checking **SHA-256 hashes** on every crawl cycle. When text is extracted from a page, a hash of the cleaned text is calculated and compared against Qdrant coordinates. If the hash matches, the vector space is left untouched. If the text has changed, the old coordinates are deleted, the new content is split into **300-word semantic chunks**, and new embeddings are constructed using `gemini-embedding-2`.

Crawl Request API Payload

The dashboard initiates crawls by sending a POST request to the administrative documents crawler endpoint:

POST /api/v1/admin/documents/crawl
Content-Type: application/json
Authorization: Bearer <your-admin-token>

{
  "url": "https://www.sentrup.com",
  "max_pages": 15,
  "source_priority": "high",
  "allowed_paths": ["/docs", "/faq", "/pricing"]
}

← Knowledge Base Setup|Set up Calendar Sync →