What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
- How to escape a content-related search engine ranking filter or penalty
- Content that requires copy writing/editing for improved quality
- Content that needs to be updated and made more current
- Content that should be consolidated due to overlapping topics
- Content that should be removed from the site
- The best way to prioritize the editing or removal of content
- Content gap opportunities
- Which content is ranking for which keywords
- Which content should be ranking for which keywords
- The strongest pages on a domain and how to leverage them
- Undiscovered content marketing opportunities
- Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of Page Rank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
- Inventory & audit
- Analysis & recommendations
- Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
Once URL Profiler is finished, you should end up with something like this:
The risk of getting analytics data from a third-party tool
We’ve noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
- Indexed or not?
- If crawlers are set up properly, all URLs should be “indexable.”
- A non-indexed URL is often a sign of an uncrawled or low-quality page.
- Content uniqueness
- Copyscape, Siteliner, and now URL Profiler can provide this data.
- Traffic from organic search
- Typically 90 days
- Keep a consistent timeframe across all metrics.
- Revenue and/or conversions
- You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
- Publish date
- If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
- Internal links
- Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
- External links
- Landing pages resulting in low time-on-site
- Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
- Landing pages resulting in Low Pages-Per-Visit
- Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
- Response code
- Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that’s the case on your domain.
- Canonical tag
- Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that’s the case on your domain.
- Page speed and mobile-friendliness
- Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
Hopefully by now you’ve made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here’s where the fun really begins. In a large organization, it’s tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it’s ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That’s all fine, as long as you’re working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
Taking the hatchet to bloated websites
For big sites, it’s best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you’ll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow’s content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
- 203 Pages were marked for Removal with a 404 error (no redirect needed)
- 110 Pages were marked for Removal with a 301 redirect to another page
- 311 Pages were marked for Consolidation of content into other pages
- Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
- 605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
- 63 “Other” pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
- No rewriting or improvements needed
These changes reflect an immediate need to “improve or remove” content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the “Details” column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the “Details” column