Posted on

How to Identify and Resolve Scanner Timeouts

Summary: An overview on identifying and resolving scanner timeouts or when pages are being blocked from scanning.

Overview

Background on how Pope Tech Scanners work

In order to crawl or scan your website(s) the Pope Tech servers need access to them. Most of the time this is pretty simple.

Pope Tech scanners access your website in the same way you do (the crawler is a little bit different). The scanner uses a headless Chrome browser and loads the page along with all of its CSS and JavaScript and then evaluates the page for accessibility.

There are some differences between you visiting your website in your browser and Pope Tech scanning your pages:

  • The scanner will scan more pages than a user would visit at a much faster pace (how many times do you visit every page on your website in a few minutes?)
  • When you visit your website you cache the assets from previous visits, our scanner doesn’t so we can catch any changes. We do try and cache assets during the same scan as there is no need to re-download your JavaScript file 1,000 times for 1,000 pages scanned.
  • Our scanners and crawler have a unique user agent. A user agent is an identifier of what is accessing a web page, Chrome browser on Windows or Firefox on Android? We use unique user agents for crawling and scanning to make it clear which traffic is coming from Pope Tech. This gives you many advantages such as filtering out Pope Tech traffic from analytics, configuring A/B testing to send a specific version of your website to Pope Tech, or making sure you let our scanners through your firewall.

Over 99% of websites scanned at the default speed won’t have any issues, but because of these differences Pope Tech is much more likely to be blocked by your website than an actual user.

Why is my scan being blocked/timing out?

The short answer is the server being scanned has a set of rules of when to block traffic and something with the scan or past scans triggered these rules. These rules are set on your servers and aren’t something that can be changed by Pope Tech. The other possibility is your website is offline.

Possible reasons Pope Tech Scanners are blocked/timeout:

  • your website is offline
  • a threshold of a certain amount of requests in a time period is hit by the same IP address
  • the pattern of requesting multiple pages one after another is flagged as suspicious
  • the page isn’t publicly available and is only available on your internal network
  • all bots are blocked and the Pope Tech user agent triggers a rule to block it

These rules are set on your server infrastructure and can differ significantly between web servers. There isn’t a specific rule to always avoid them but there are things you can do to avoid or fix it.

How to identify if scans contain timeouts/pages are being blocked?

Web servers handle blocking traffic in many different ways. Sometimes they return an error code in the response header such as 429 (too many requests), other times they simply block a server with no message, or they might return the page with different content.

There are two main ways these different scenarios of your scan being blocked showing up inside of Pope Tech:

  • Pages process but don’t scan
  • Page titles are all changed to something like, “Access Denied”

Pages process but don’t scan

Sometimes when our scanners try and access a page nothing happens, the server doesn’t tell us anything is wrong it just waits and then timeouts. If it is a few pages that this happens on and is always the same pages contact our support team and we can look into it for you. There might be something specific to those pages causing the issue.

If the pages this happens on changes each time then most likely your scan is being rate limited or potentially even temporarily taking down your web server. If it happens on all pages your scan might be getting blocked completely or your website might not be accessible to all public traffic such as our servers.

In the main menu inside of Pope Tech go to the “Scan” view under the “Accessibility” section.

The “Scan” view has a widget called “Scans” where all scans are listed. The number of pages that didn’t scan are listed next to each scan. If more than 10% of the pages don’t scan successfully the color will be yellow.

In this example all 444 pages processed but 223 of them didn’t scan. If you open up the scan details by activating the actions button with the arrows it will list all the pages that scanned and which ones didn’t.

In the “Not Scanned” tab it shows all pages that didn’t scan but were processed. The most often error messages are Timeout or Internal Error depending on where your server timed out at.

Page titles are all changed to something like, “Access Denied”

When Pope Tech scans a page it updates the page title to match what the pages HTML page title is. This keeps the page title up to date. Sometimes when a server blocks Pope Tech traffic it serves a page with different content instead, usually when this happens the page title is updated as well.

In this example a web server is blocking Pope Tech traffic and has updated all of the pages titles to be “Access Denied”.

Page titles are shown in Pope Tech when looking at the Scan Details view or editing the website.

The page titles don’t come from Pope Tech but from the HTML page title of the pages. So in your case the text might vary. This is a clear indication that something is blocking Pope Tech traffic.

Possible solutions to avoid or fix being blocked

First identify if all pages are being blocked or just some? do any pages scan? if some pages do and others have a page tile of “Access Denied” or simply process and don’t scan than you know that the Pope Tech IP address and user agent aren’t blocked completely but maybe rate limited.

This is also what you would find if the web server was going down in the middle of a scan. In either case simply slowing down the scan speed and/or scan less websites at a time might be helpful.

How to slow down scans

To slow down the scan speed, try reducing your “Scan Rate Limit” and/or increase your “Evaluation Delay” in the website settings. To review information on these scanner settings, review the Knowledge base article on Scanner Settings.

Allow Pope Tech scanners through firewall

If all pages are being blocked or you want to scan faster and make sure nothing is ever blocked, adding a rule to allow Pope Tech traffic through your firewall is the way to go.

This is typically done by IP address. The Pope Tech crawlers and scanners are configured to use the same IP address to make this easy on your end.

The user agent can also be used, especially if your server is blocking Pope Tech traffic because of the user agent.

Pope Tech IP Address:

Contact the Pope Tech Support team and we can provide you with this.

User agents:

  • PopeTech-ScanBot/1.0 (+https://pope.tech)
  • PopeTech-CrawlBot/1.0 (+https://pope.tech)