SEO managers understand how essential rankings are for any business with a presence online. What Google says, goes. And that can make a big impact on your site’s ability to organically reach your ideal customers. 

Understanding how and why Google crawls your site is a critical step toward optimizing your site and tapping into the potential organic search holds for your business. That’s why SEO managers, and anyone working in the digital field, need a rock solid grasp on how Google crawls, evaluates, and ranks their websites. That knowledge is the key to building a solid SEO strategy. And, while there are plenty of levers to pull in the SEO world that can help you understand how Google is seeing your site in real time, perhaps the most important is a log file analysis

Let’s first look at how Google’s bots analyze a website. Then, we’ll take a closer look at the log file analysis and how you can perform your first one today.

How does Google crawl your site?

Google is an incredibly far-reaching amorphous entity that is constantly searching the dustiest corners of the web to document every available site. To keep their database current, and its algorithm meeting the needs of users, Google needs to consume and catalog the entire internet regularly.

In order to do this, they need impossible manpower. Enter Googlebot. Googlebot is just what it sounds like: a robot (well, a collection of robots). These bots are known as web crawlers, built and used by Google to find and evaluate content all over the world wide web. 

When a Googlebot crawls a website, it takes in all of the relevant data it can find: text, pictures, graphics, metadata, header tags, etc. Then, the bots place all of that information in a catalog for your site-- a kind of file that Google references when making algorithmic decisions. 

Using the information gleaned by its bots, Google evaluates the relevancy of your site and web pages. They do this with a complex and ever-changing algorithm that evaluates the usefulness of your site for various search terms. But, while the algorithm itself is complex, its purpose is not. Google wants to stay in business. And, in the simplest sense, they do that by continuing to answer the search queries of users better than any other competitors. By focusing your attention on best meeting the needs of your ideal customers on your site, you will fight side by side with Google’s algorithm rather than against it. 

Google has a lot to do. It’s bots can’t spend all day on your site just because you’d like them to. They will give a limited crawl budget to your site when they locate it, and it is up to you to make the best of that time. Relevance and keyword rankings are determined by these crawls, so be sure that your SEOs know how to maximize the limited time Google allocates to your site.

This limited budget stress is where the log file comes in handy.

What is a log file?

A log file is a file stored on your web hosting server that documents events occurring on an operating system, which in this case would be your hosted domain. There are different types of log files (error log files, access log files, etc.) but when you run a log file analysis you’ll be specifically looking at the access log file.

All sites should have access logs set up with their web hosting server by default but you will need to reach out to your hosting service provider to verify if you want to be sure.

What is a log file analysis?

A log file analysis is the investigation of existing log files which should provide the insights needed to: 

  1. Understand Googlebot’s priorities and behavior while crawling your site 
  2. Identify any issues Google has crawling the site
  3. Provide an action plan to resolve those issues and optimize your site for prime crawlability

The log file analysis has three steps: data gathering, analysis, and implementation. I’ll walk you through each element to show how each phase feeds into the next.

Gathering the data

Before you begin the log file analysis, you need to be sure you’re looking at the correct data. Use Screaming Frog Log File Analyzer to help you locate the right information. Here’s what to look for: 

  • 1-3 months of access logs from the domain being analyzed: 1-3 months worth of past website log file data will give you an idea for Google’s most recent and relevant crawl behavior for your site. If you are using Screaming Frog Log File Analyzer to run the actual analysis as well (which we recommend), you’ll need the access log files to be in the following formats:
    • WC3
    • Apache and NGINX
    • Amazon Elastic Load Balancing
    • HA Proxy
    • JSON
  • Screaming Frog crawl data: This data will be overlaid with the log file crawl data in order to match up things like rel=“canonicals,” meta robot tags, and other URL specific data. Having a range of data will help tell a complete story on how Google is crawling your site, thus leading to more informed recommendations.
  • Google Analytics data: This will also be overlaid with the log file crawl data as a way to see how the most conversion-heavy pages are being crawled by Google. It will also contain session data that will help us understand the implications of Google’s crawls on your site.

Once you have gathered the pertinent website logs and data, you’ll be able to move on to the actual analysis.

Analysis

To analyze all this data I use the following toolset:

  • Screaming Frog Log File Analyzer: This is the core tool we use in the log file analysis. Here’s a great intro guide on what this tool is and how to use it.
  • Screaming Frog SEO Spider: This is what we’ll use to extract the URL specific data for the site being crawled.
  • Google Sheets or Excel: This is where we’ll be doing our data manipulation.

In executing the log file analysis, here are a few things to look for:

  • Are there any subfolders being over/under crawled by Googlebot?
    • To find this go to the Screaming Frog Log File Analyzer: Directories, with special attention given to the crawl numbers from Googlebot.
  • Are your focus pages absent from Google’s crawls?
    • To find this go to the Screaming Frog Log File Analyzer: URLs. If you have Screaming Frog SEO Spider data coupled with the log file data you can filter down the HTML data with the view set to ‘Log File.’ From there you can search for the focus pages you want Google to care most about and get a feel for how they are being crawled.
  • Are there slow subfolders being crawled?
    • To find this go to the Screaming Frog Log File Analyzer: Directories. You’ll need to sort by Average Bytes AND Googlebot AND Googlebot Smartphone (descending) so that you can see which subfolders are the slowest.
  • Are any non-mobile friendly subfolders being crawled by Google?
    • To find this go to the Screaming Frog Log File Analyzer: Directories. You’ll need to sort by Googlebot Smartphone in order to see which pages aren’t getting crawled by that particular Googlebot, which could be an indication of a mobile friendliness issue that needs to be addressed.
  • Is Google crawling redundant subfolders?
    • To find this go to the Screaming Frog Log File Analyzer: Directories. As you examine the subfolders listed therein, you should be able to see which directories are redundant and require a solution to effectively deal with them.
  • Are any 4XX/302 pages being crawled by Googlebot?
    • To find this go to the Screaming Frog Log File Analyzer: URLs. Once you identify the broken pages Google is hitting you’ll know which ones require higher priority to 301 redirect.
  • Is Google crawling any pages marked with the meta robot no-index tag?
    • To find this go to the Screaming Frog Log File Analyzer: URLs. You’ll need to sort by ‘Indexability,’ then by ‘Googlebot,’ and ‘Googlebot Smartphone’ to get a feel for which pages are marked as no-index but are still getting crawled by Google.
  • Are the rel canonicals correct for heavily crawled pages?
    • To find this go to the Screaming Frog Log File Analyzer: URLs. This is where you can see if the rel canonicals on the pages getting crawled the most have the correct rel canonical URLs.
  • What updates to the robots.txt file/sitemap.xml are needed in order to ensure your crawl budget is being used efficiently?
    • Based on what you find in your analysis, you’ll be able to identify which subfolders or URLs you’ll need to disallow (robots.txt), remove, or include in the sitemap so you’re sending the clearest possible signals to Google regarding which pages you want crawled.

Implementation

In answering these questions you’ll gain valuable insights on what may be holding back your website’s performance and how you can improve it. But, the journey doesn’t stop there. Once you have these insights, you must work to implement them. You’ll want to build out a list of items that need tackling, how you plan to implement those changes, as well as a plan to improve the crawlability of your site going forward.

Some of the items we’d recommend you focus on include:

  • Configuring and improving how Google crawls your site
    • Using the robots.txt to disallow sections of the site we’re seeing Google take time on that don’t need to be crawled
  • ID additional technicals SEO fixes for the site
    • Updating meta robot tags to better reflect where you would like Google to focus its crawl budget
  • Broken pages
    • Building 301 redirects for 404 pages that Google bot is consistently hitting
  • Duplicate content
    • Building a content consolidation game plan for redundant pages that Google is splitting its crawl budget on
    • This game plan would involve mapping out which duplicate/redundant pages (and even subfolders) should either be redirected or have their content folded into the main pages being leveraged in the site’s keyword targeting strategy

Once this list of recommended changes has been built, you’ll need to work with your web development team to prioritize your next steps. I recommend rating each item on a scale of 1-5 on these three categories:

  • Difficulty to implement
  • Turn-around time
  • Potential for SEO yield

Once the priority has been established, you’ll work with your web development team to implement these fixes in a manner that works best for their development cycles.

Ready for some results?

Sounds like a lot of work, but it’s worth it. To show you just how important this analysis can be, here’s a brief case study that demonstrates the impact a log file analysis can have on an SEO strategy.

During a recent client engagement, we were working to increase e-commerce transactions brought in from Google organic traffic.

We began the journey as we generally do, by performing a series of technical audits. As we examined Google Search Console, we noticed that there were some indexation irregularities. Specifically, there were pages missing in Google’s indexation and overall coverage of the site. This is a common symptom of a crawability issue.

So, we ran a log file analysis to identify ways we could improve how Google crawls the site. Some of these findings included:

  • A number of redundant subfolders being crawled by Google
  • Broken pages missed in our initial site audit that need to be redirected
  • Various subfolders that Google was spending time crawling that didn’t actually play a role in our SEO keyword ranking strategy

We created an action plan based on these findings, and worked with a web development team to ensure they were addressed.

Once the log file findings were implemented, we saw the following results (comparing 30 days with the previous 30 days):

  • e-commerce transactions increased by 25%
  • e-commerce conversion rate increased by 19%
  • Increase in Google organic e-commerce revenue by 25%

As with all SEO strategies, it’s important to make sure Google acknowledges the changes you’re making and rewards your site accordingly. Running a log file analysis is one of the best ways you can make sure this happens, regardless of the other technical SEO fixes you are implementing.