TF-IDF, the technical name for semantic analysis, is a formula that stands for term frequency–inverse document frequency. Simply put, TF-IDF is a formula that can help you understand what topics to include in your content marketing pieces if you intend for them to rank. We’ll break the TF-IDF formula into two segments, and give a few TF-IDF examples, in order to better understand how it works.
What is TF? The first portion of TF-IDF is the TF portion, or Term Frequency. Term frequency counts the number of times a given word appears in an article. For instance, if I have an article that mentions the word “swimming” in it four times, the term frequency of the word “swimming” is four. Now, say you wanted to search through a stack of pages for the one that is most relevant for “how to swim.” Instead of searching for the term “swimming” which would bring up many documents related to all aspects of swimming rather than just “how to swim” you’d instead want to search for the term frequency of each word: “how” “to” “swim.” Solely using term frequency, however, you’d quickly see issues with this method as well. The most frequent terms listed on the analysis would be common stop words like “the,” “and,” or “a.” This makes sense-- these are the most-used words in any document. However, these words aren’t specific, and definitely won’t provide you with insight into which article in the stack is most relevant for “how to swim.” It’s at this point that the IDF portion of the TF-IDF formula, or Inverse Document Frequency, becomes very valuable.
The IDF portion of the formula diminishes the weight of the terms that occur the most frequently within a text. This is helpful because it devalues common stop words such as “the” or “and,” and uncovers the topically revealing terms within the document. Returning to our earlier example, this would allow you to quickly find the article most related to “how to swim,” and taking the analysis to the next level, you’d also be able to determine which other terms are often used within articles about “how to swim.”
This added insight would be very useful if you needed to write a piece of content about swimming, and wanted it to rank for the search term “how to swim.” Running a TF-IDF analysis on a group of similar, already-published articles would reveal terms that are frequently used among these articles-- things like “kickboard,” “freestyle,” or “goggles.” Then, you could deduce that it might be important for you to include those terms in your own piece of content.
If you apply this type of analysis to a Search Engine Result Page (SERP), you can quickly understand which terms and topics are viewed as related to the core keyword that was searched. It’s a good idea to run a TF-IDF analysis for the pages ranking in the top 10 positions of the desired SERP. Then you’ll get the results that Google (or your chosen search engine) has already deemed most significant. For our search term “how to swim,” perhaps you find that “float,” “kick,” and “stroke” all have high TF-IDF scores. In order to ensure that your article is relevant and worthy to rank on the first page for “how to swim,” you would want to include those topics in your article.
Putting It Together
Let’s walk through a final TF-IDF example to further illustrate its benefits, and fully answer the question originally asked: “What is TF-IDF?” as well as “Why should I use TF-IDF?”
Data backs all good decisions. And the data that TF-IDF will bring to your content decision making can make or break its success. Without the specifics you glean from this new information, the content that you create will not be as directed or targeted. Intuition, in this case, is not enough. The results of your TF-IDF analysis may surprise you-- and give your content the purpose it lacked.
Let’s dive into our last example. Say you want to create content that will rank for the phrase “effects of caffeine.” You and I know there are a lot of effects of caffeine. But what you’d want to know before you began creating that content is which topics are most commonly discussed among the articles that Google ranks for the phrase “effects of caffeine.” Running a TF-IDF analysis would show that in order to make your article completely relevant to the keyword “effects of caffeine,” you should include topics such as “irritability,” “coffee,” and caffeine's effects on “blood pressure” within your text.
By shaping your document around the topics that Google has already deemed important will not only help you create better content, but also help you to rank better in Google’s results.
Here at 97th Floor, we’ve seen the TF-IDF technique bring dramatic increases in relevancy and ranking of clients’ new and existing content. It’s a process we always use when creating or optimizing content, especially if we are intentionally pivoting our strategy. We use our own patent-pending software, Palomar, to perform these analyses. However, there are many tools available to businesses wanting to push their SEO to the next level.
It’s worth it. The effort you put into getting it done will pay off ten times over. TF-IDF is an essential part of any SEO strategy, and indeed any digital marketing strategy. Why? Because content rules the world right now, and TF-IDF is decidedly most effective when paired with a solid content strategy.
For a great content marketing strategy that pairs well with TF-IDF, see Brian Dean’s discussion on the skyscraper technique.
And for an introduction on how you can run a TF-IDF analysis on your own, see 97th Floor’s TF-IDF presentation. This presentation will walk you through how to run a TF-IDF analysis using completely free tools.