The TF-DF framework and can be derived from the knowledge of Shanghai Dragon

A variant of the

summary of Zhang Junlin’s "this is the search engine" for the TF-IDF framework. Because of the long, I think focus outlined here, there may be an overview of shortcomings, so more details are recommended to see the original book.

frequency factor (TF)

factors, based on a different starting point, we can adopt different formula. The easiest way is to directly use the word frequency, such as a document in a word occurs 5 times, its TF value is 5.

when a user searches for a word in the search engine, it will go to the library and index word document to the matching calculation, and a number of the most relevant words of the document out and participate in the follow-up ranking calculation. Here the "most relevant" quantitative index called "weight", but for the vast majority of search engine, the calculation of weight in the TF*IDF framework is an important part of the. Which is the main factor to consider: TF frequency and inverse document frequency IDF.

This paper first cites a

(Note: "TF-IDF" or "TF*IDF" is written in the book habit, with the TF*IDF, does not mean that there are differences between the two)


this is an article on the basis of ranking search engine TF-IDF framework the popularity of articles, not the Internet occasionally some visible content or even speak generally interpret out of context but a combination of search engines, the theory of many examples and observed the summary of practical knowledge. Although it may be relatively difficult to understand, but I believe that these are used to understand the time is absolutely worth it.

will take Log TF frequency numerical value as frequency weights, such as words in the document appear 4 times, the frequency factor weights of 3, in the formula of the number 1 is used to smooth. Because if the TF value is 1, the Log value is 0, which originally appeared in a word, in accordance with this method will think this word has never appeared in the document, in order to avoid this situation, using +1 way to smooth. The reason for the frequency of Log, is based on the following considerations: even if a word appears 10 times, should also be in the calculation of feature weight ratio, appeared 1 times the weight is 10 times larger, so added Log inhibit the excessive.

TF calculation factor represents the word frequency, which a word occurs in the document number. In general, the higher the frequency is the related documents and words, the weight should be given to this word more.

When calculating the frequency specific

frequency factor formula is: W = 1+log (TF)

to write this article is mainly on the back to a "Shanghai dragon practice" series of articles to mention some of the contents of the first written basic theory, not to occupy the space inside the text.

TF-IDF overview

Leave a Reply

Your email address will not be published. Required fields are marked *