TF-IDF, the technical name for semantic analysis, is a formula that stands for term frequency–inverse document frequency. Simply put, TF-IDF is a formula that can help you understand what topics you should include in your content marketing pieces if you intend for them to rank. Let’s break TF-IDF into two segments in order to better understand the formula.
What is TF? The first portion of TF-IDF is the TF portion, or Term Frequency. Term frequency counts the number of times a given word appears in an article. For instance, if I have an article that mentions the word “swimming” in it four times, the term frequency of the word “swimming” is then just four. Now say you wanted to search through a stack of articles for an article that is the most relevant for “how to swim.” By solely using term frequency, you can start to run into issues because term frequency only counts the number of times each term is used in an article. So as I use term frequency to search through my stack of articles for “how,” “to,” and “swim,” I will see that the most frequent terms used are usually common stop words like “the,” “and,” or “a.” These words don’t provide me with any insight into which article in my stack is the most relevant for “how” “to” “swim.” It is at this point where the IDF portion, or Inverse Document Frequency, becomes very valuable.
The IDF portion of the formula diminishes the weight of terms that occur the most frequently within a set of documents. To define IDF: it stands for Inverse Document Frequency. This is helpful because it will devalue common stop words like the word “the.” If we apply this type of analysis to a Search Engine Result Page (SERP), we can quickly understand which terms and topics are viewed as related to the core keyword that was searched. For instance, if I want to understand what topics Google deems related to the keyword “how to swim,” I will run a TF-IDF analysis on the pages that rank in the top 10 positions for the keyword “how to swim.” Through my analysis I will find that the topics “kickboard,” “freestyle,” and “goggles” all have a high TF-IDF score. This tells me that these topics are typically discussed on pages that rank for the keyword “how to swim.” In order to make my article more relevant and worthy to rank on the first page for “how to swim,” I should include these topics in my article.
Putting It Together
Let’s go through one more example to further illustrate the benefits of TF-IDF and fully answer the question originally asked: “What is TF-IDF?” Say we want to create content that will rank for the phrase “effects of caffeine” - you and I know there are a lot of effects of caffeine. But what we want to know before we begin content creation is what are the topics that are most commonly discussed in the articles that Google ranks for the phrase “effects of caffeine.” Running a TF-IDF analysis shows that in order to make our article completely relevant to the keyword “effects of caffeine,” we should include topics such as “irritability,” “coffee,” and caffeine's effects on “blood pressure.”
Including topics that Google has already deemed important will not only help you create better content, but also help you to rank better in Google results.
Here at 97th Floor, we have used the TF-IDF technique for many clients' new and existing content and have seen dramatic increases in relevancy and ranking of the articles for which we've applied the analysis. Keep in mind that TF-IDF is most effective when paired with a solid content strategy.
For a great content marketing strategy that pairs well with TF-IDF, see Brian Dean’s discussion on the skyscraper technique.
And for an introduction on how you can run a TF-IDF analysis on your own, see 97th Floor’s TF-IDF presentation. This presentation will walk you through how to run a TF-IDF analysis using completely free tools.
Feel free to reach out to me at firstname.lastname@example.org with any questions or for more information.
If you'd like Josh to share how to run a TF-IDF analysis, leave us a comment below.