文档首页

Knowledge

Website crawl

Ingest public help centers and marketing docs: crawl types, depth, politeness, optional Gemini image recognition and audio/video transcription, and production queue behavior.

crawlhelp centerdocumentation sitesitemapgemini

Crawling suits content that already lives on domains you control. It complements connectors when there is no API for a static site or microsite.

You can crawl HTML text only (fast, no Gemini required) or optionally enrich pages with descriptions of linked images and transcripts of linked audio/video. Those AI steps need a working Gemini configuration (hosted or your org key).

Respect robots.txt when enabled; FlexyAgents applies delays between requests to reduce load on your origin. In production, long crawls may queue and run from background workers—check crawl status in the dashboard.

Crawl types: homepage, sitemap, full site

Homepage crawls a single URL—useful for landing pages or smoke tests.

Sitemap mode reads sitemap.xml (and may follow nested sitemap indexes up to a cap) to enumerate URLs; good for doc sites that maintain an accurate sitemap.

Full site starts at a seed URL and follows same-domain links up to a maximum depth and page count—best when navigation matters and sitemaps are incomplete.

  • Tune max pages and depth to match infrastructure; very large crawls belong in maintenance windows.
  • Follow-links applies to full crawls; homepage and sitemap modes do not wander arbitrarily.

Seeds, scope, and politeness

Start from a seed URL or sitemap. Restrict paths to `/help/` or `/docs/` so marketing fluff does not dilute support retrieval.

Exclude authenticated employee-only paths unless you intend internal content in that base.

  • Set per-request delay (seconds) to stay polite to small origins.
  • Toggle robots.txt respect when your policy requires strict compliance.

Image recognition vs audio/video transcription

Two independent switches control whether the crawler fetches and processes images, and whether it processes audio and video URLs found on each HTML page.

When enabled, the system downloads media within size limits, runs Gemini vision or transcription, and appends extracted text to the crawled document. Failures produce short placeholders so you can see which URLs broke.

  • SVG and data: URLs are skipped for image processing for compatibility reasons.
  • Hosted crawl quotas apply per successful hosted Gemini call; organization Gemini keys bypass hosted counters—see Documentation → Knowledge → AI vision, transcription & limits.

Authenticated portals

Some teams front portals with SSO; coordinate with IT for supported crawl patterns or use exports/connectors instead.

Never store customer credentials in crawl configs.

Maintenance

Schedule recrawls after major doc releases. Broken links in source HTML produce gaps—fix upstream.

Pair crawl with analytics to see which URLs actually drive answers.

在你的技术栈上构建

准备上线有依据的助手了吗?

开始试用,或与我们沟通上线、治理和企业级要求。