The purpose of Pope Tech’s crawler is to navigate a website and to add any found web pages into your dashboard. This crawler can be modified in how it interacts with your website using the crawler settings.
For a transcript synchronized with the video, watch the video: Crawl Settings with transcripts open on Youtube.
To access your crawler settings, navigate to the Websites page under the main menu and select the actions button under any of the websites. Under the edit website page and under the website settings widget, find and select the crawler options. Here you will have options to modify the crawler settings for Max Pages, Max Depth, Start Page, Includes Subdomain, Filters, and Crawl Rate Limit.
The max pages settings will limit the amount of pages that the crawler will add into your website. Once it reaches this limit you will stop.
The max depth can be defined as the number of links or hops that the crawler goes through to find a particular web page. If it finds a link on your home page and navigate to that page linked to it that will be a depth of two. And if it finds a linked page from that link page that will be a depth of three. And so on and so forth. Most web pages will have a maximum depth of five or six so the default setting of 10 should be adequate for most scenarios.
The start page option is the option to start a crawl on a different page than the base URL. this can be helpful when perhaps your base URL doesn’t link to – and any of its pages don’t link to something like maybe your admin menu. In this scenario you would crawl it without doing the start page option first – and find all of your main pages that are linked, and then you would start another crawl with a separate start page to get the other pages that were missed.
The Include Sub Domain option is the option to also add discovered subdomains into your dashboard. If you select this, keep in mind that these will be added as new websites and be separate from this website.
Under add filters you can add whitelist or blacklist filters. The whitelist filter option will take anything in this filter path and it will only crawl pages that follow along this URL path. In this instance no other pages other than calendar pages will be added into your website the blacklist is the opposite of that. This will in this instance not get any calendar pages and add them to your website, but all other pages on the website will be added. You can also add multiple blacklist filters if wanted.
The crawl rate limit is defined as the amount of pages that your crawler will crawl per minute. The default setting of medium is 60 pages per minute and this is adequate for most scenarios. If you have a website that is perhaps more fragile, more likely to reach its maximum capacity by a crawler bot scanning it so quickly, then you should select the low setting.
Once you save your settings for your crawler and verify that they are as wanted you can come down below and under your Pages and Templates widget you can start a crawl.