Crawl Issues

Diagnose and fix IndexMind crawl failures, timeouts, robots.txt blocking, rate limiting, and incomplete page discovery.

The crawl stage visits your website pages and collects content, metadata, and technical signals. When it fails, IndexMind reports the specific error on your project dashboard. This guide covers the most common crawl problems and their solutions.

Crawl timeout

Symptom. The crawl stage shows "Timed out" after running for several minutes. Partial results may be available.

Common causes and solutions:

Too many pages. Your site exceeds your plan's page limit or the crawler discovered more pages than expected. Reduce crawl scope by adding path exclusions in project settings (for example, exclude /archive/* or /tag/*).
Slow server response. If your server takes more than 10 seconds to respond per page, the total crawl time can exceed the limit. Check your server performance and consider reducing crawl depth.
Redirect chains. Long redirect chains (3+ hops) slow down crawling. Fix redirect chains on your server to point directly to the final URL.

IndexMind sets a 10-minute timeout for the crawl stage. If your site requires more time, reduce crawl scope or contact support to discuss options for large-site configurations.

Robots.txt blocking

Symptom. The crawl report shows "Blocked by robots.txt" for some or all pages. The page count is lower than expected.

Solution. IndexMind's crawler identifies itself with the user-agent IndexMindBot. To allow IndexMind to crawl your site, add the following to your robots.txt file:

User-agent: IndexMindBot
Allow: /

If you use a wildcard disallow rule (Disallow: /), it blocks all crawlers including IndexMind. Either add an explicit allow for IndexMindBot above the disallow rule, or adjust your disallow rules to be more specific.

Verifying your robots.txt

After updating your robots.txt, you can verify it by:

Visiting https://yoursite.com/robots.txt in a browser to confirm the changes are live.
Running a new analysis in IndexMind. The crawl report shows which pages were allowed and which were blocked.

Rate limiting (HTTP 429)

Symptom. The crawl report shows "Rate limited" or HTTP 429 errors for multiple pages.

Why this happens. Your server or CDN is returning HTTP 429 (Too Many Requests) responses because IndexMind's requests arrive faster than your rate limit allows.

Solutions:

Adjust crawl rate in project settings. Go to your project's Settings tab and reduce the Crawl rate (requests per second). The default is 5 requests per second. Try reducing to 2 or 1.
Allowlist IndexMind's IP ranges. Contact support@indexmind.com to get the current list of crawler IP addresses. Add these to your CDN or WAF allowlist.
Check CDN/WAF rules. Some CDNs (Cloudflare, Akamai, Fastly) have bot protection rules that trigger on automated traffic. Create an exception rule for the IndexMindBot user-agent.

Authentication-protected pages

Symptom. Pages behind login walls return HTTP 401 or 403 errors.

Current limitation. IndexMind's crawler does not support authenticated crawling. It can only access publicly available pages. Pages behind authentication are skipped and excluded from analysis.

If important content is behind authentication, consider:

Creating a staging environment with authentication disabled for crawling
Using canonical URLs to point authenticated content to public equivalents

Incomplete page discovery

Symptom. The crawler found fewer pages than your site actually has.

Possible causes:

No sitemap. If your site does not have a sitemap.xml at the standard location, the crawler relies on link discovery. Orphan pages (pages with no internal links pointing to them) will be missed. Add a sitemap to ensure complete coverage.
JavaScript-rendered content. IndexMind's crawler processes server-rendered HTML. Content rendered exclusively by client-side JavaScript may not be discovered. Use server-side rendering (SSR) or pre-rendering for important pages.
Crawl depth limit. The default crawl depth is 3 levels from the root URL. Pages deeper than this are not visited. Increase the crawl depth in project settings if needed.
Path exclusions. Check your project settings for path exclusion rules that may be filtering out pages unintentionally.

SSL/TLS errors

Symptom. The crawl fails with "SSL certificate error" or "Connection refused."

Solutions:

Verify your SSL certificate is valid and not expired using an online checker (e.g., SSL Labs).
Ensure your server supports TLS 1.2 or higher. IndexMind does not support TLS 1.0 or 1.1.
Check that your certificate chain is complete (includes intermediate certificates).

Next steps

If your crawl issue persists after trying these solutions, contact support with your project name and the specific error message from the crawl report.