Posted on

Detail Crawl Guide

Summary: A detailed look at the Crawl settings and features. Options for how to add pages into your website to be scanned.

Overview

Crawler Purpose and Crawl vs Scan

The purpose of the Crawler is to discover the webpages of a website and add them into your Pope Tech dashboard so that each page can be evaluated for accessibility results. In its simplest form, evaluating your website with Pope Tech is a three step process:

  1. Add a website
  2. Crawl the website to add pages into the dashboard (this guide)
  3. Scan a website – evaluate each added page for accessibility results

For the purpose of this guide, it is important to note that the crawler does not evaluate websites for accessibility results. To evaluate a website, run a Scan after the crawl has completed successfully.

Website Crawl – Quick Start

Before crawling a website it is best to verify that:

  1. The website URL is valid and active
  2. Crawler settings are appropriate:
    • The Max Pages setting is greater than the estimated page count of the website
    • The Max Depth setting is sufficient for the website structure

To run a crawl:

  1. Add a website
  2. Activate the crawl by selecting the Crawl button found within the Pages and Templates section of the website

To view the status of a crawl, select the Crawls tab in the Scans and Crawls portion of the website. You can also cancel a currently running crawl by selecting the cancel action button.

Default Crawl – Sitemap Import

By default, starting a crawl will attempt to import pages from a sitemap instead of running a traditional crawl. If a valid sitemap is discovered, all same-domain URLs from a sitemap will be quickly imported as pages into the dashboard.

If the sitemap doesn’t exist at the specified location or is unreachable or invalid, the process will then fall back to running a traditional crawl.

Sitemap Options

  • Discover sitemaps from robots.txt – If your sitemap is located within your robots.txt file, you can direct the sitemap crawl to use the standard robots.txt location location of [Base URL]/robots.txt
  • Sitemap URL – If your sitemap is in a non-standard location you may select a custom sitemap location to run a sitemap crawl from. This is the Absolute URL to your sitemap (include https://…). If left empty, this will default to [Base URL]/sitemap.xml

Sitemap Limitations

  • The sitemap import is specifically looking for XML sitemaps that follow the sitemap protocol. If your sitemap is an HTML sitemap, unselect the Use Sitemap option and run a traditional crawl using the HTML sitemap as the crawl start page.

Traditional Page Crawl

The traditional crawl will crawl (or spider) your website by systematically following links and collecting the page URLs that it finds to add them into your website. Crawl duration will vary according to website complexity and crawl options or settings.

Crawler Options

  • Max Pages – this is the upper limit of how many page URLs a single crawl or sitemap import will attempt to add into a website. Duplicates will not be added. For example, crawling a 1000 page website with a Max Pages setting of 100 will only add 100 pages. Re-crawling the same website with the same settings will generally add 0 pages because the pool of 100 pages that were attempted to be added are duplicates.
  • Max Depth – this is the degree of link separation from the website home page or crawl start page. URLs found on the home page have a depth of 1 (no separation). Pages that are linked to from the home page will have links with a depth of 2, and so on. If your Max Depth was set to 2, the crawler would stop at this point, gathering only the URLs found on the home page and pages that are linked directly from the home page.
  • Crawl Start Page – the base URL of the website is the default crawl start page. Some times you will discover more or additional pages by changing where the crawl process starts. HTML sitemaps or other link directory pages are often great options to use as the crawl start page. You can also use this option to crawl and add pages from a directory that is not directly linked to from the main website.
  • Include Subdomains – A subdomain is a domain that is part of your domain (4k.pope.tech vs pope.tech). Pope Tech will always handle subdomains as separate websites. Crawls will not look for subdomain pages unless you select the option to include subdomains. If this option is selected, subdomain pages that are discovered will be added under their own website that will be created automatically.
  • Archive pages not found – If selected, the archive pages not found option will crawl your website and then compare the list of found pages against the current list of pages. If a page that was previously found is not on the latest list then it will be archived.
  • Use Sitemap (Enabled by default) – An alternative option to traditional crawling. Use Sitemap will import a page list from your sitemap.
  • Discover sitemaps from robots.txt – only applicable when the Use Sitemap option is selected.
  • Sitemap URL – only applicable when the Use Sitemap option is selected.
  • Filters – using filters will enable you to change which directories or pages you will add URLs from. You can filter by directory, full URL path, or by keywords. Filters can be set as blacklist or whitelist filters:
    • Blacklist – example: /calendar/ – This will not crawl any URLs that contain /calendar/ in them
    • Whitelist – example: /admissions – This will only crawl pages in the /admissions directory
  • Crawl Rate Limit – The speed at which the crawler will navigate through your website. Websites that are not intended for high traffic should use lower speeds when crawling their websites. Crawling a website too quickly can sometime result in our crawlers being blocked either temporarily or permanently. Crawling too quickly can also cause your website to go temporarily offline. You are responsible for understanding the limitations of your website and change your crawl speed as appropriate.

Crawl limitations

  • Forbidden by Robots – The crawl will respect your robots.txt rules for the website. If your robots.txt file contains a rule forbidding crawl/bot access, then the crawl will return with an error showing that it was forbidden by robots.txt. To resolve this, review the article Forbidden by Robots.txt.
  • JavaScript: The crawl does not currently support links that are rendered via JavaScript. Specifically, this means that any website links that are generated with JavaScript will not be found using the traditional crawler. Because you cannot run a traditional crawl on these pages, you will need to add them into your account using one of the following methods:
    • Sitemap import
    • Upload CSV list of pages
    • Manual entry
  • Websites with authentication (web pages behind a login portal): Unlike the scanner that can evaluate pages that are behind a login portal, the crawler does not currently run behind logins. To add pages that are behind authentication, you will need to use one of the following methods:
    • Sitemap import
    • Upload CSV list of pages
    • Manual entry
  • Malformed HTML – The crawler is an automated tool that will systematically look for same-domain web pages. The success-rate of the crawler is often dependent on how closely the website conforms with current HTML standards. Because there are infinite possibilities on how a website might be formatted, results can vary accordingly. In cases where successive crawls are finishing with limited and incomplete findings, you may have better results by using one of the following methods to add pages:
    • Sitemap import
    • Upload CSV list of pages
    • Manual entry

Troubleshooting Crawls

If your crawl timed out or finished with incomplete results you can run it again by starting another crawl. If you are reputedly getting limited results, try using another method (upload from sitemap), or use a different web page as the Crawl Start Page. Review your other crawl settings and change them as needed. If the issue persists, review the crawl limitations and ensure that your website is compatible with the crawl method you are attempting. If you continue to have issues with a crawl after trying these options, contact Support.

Additional Information

Organizational Default Settings – Crawler settings can have their default values changed across an organization. These settings will go into effect for all newly added websites. For example, if you changed your Organization Default setting for the Crawl Max Pages to be 1000 (default value is 30 pages), any new websites that are added after the setting change will have a Max Pages set to 1000. This change will not affect existing websites.

Mass Import – For large website portfolios, The option to import websites in bulk via CSV will enable you to add websites with custom crawl values that are unique to each website.