summary of Zhang Junlin’s "this is the search engine" for the TF-IDF framework. Because of the long, I think focus outlined here, there may be an overview of shortcomings, so more details are recommended to see the original book.

frequency factor (TF)

factors, based on a different starting point, we can adopt different formula. The easiest way is to directly use the word frequency, such as a document in a word occurs 5 times, its TF value is 5.

when a user searches for a word in the search engine, it will go to the library and index word document to the matching calculation, and a number of the most relevant words of the document out and participate in the follow-up ranking calculation. Here the "most relevant" quantitative index called "weight", but for the vast majority of search engine, the calculation of weight in the TF*IDF framework is an important part of the. Which is the main factor to consider: TF frequency and inverse document frequency IDF.

will take Log TF frequency numerical value as frequency weights, such as words in the document appear 4 times, the frequency factor weights of 3, in the formula of the number 1 is used to smooth. Because if the TF value is 1, the Log value is 0, which originally appeared in a word, in accordance with this method will think this word has never appeared in the document, in order to avoid this situation, using +1 way to smooth. The reason for the frequency of Log, is based on the following considerations: even if a word appears 10 times, should also be in the calculation of feature weight ratio, appeared 1 times the weight is 10 times larger, so added Log inhibit the excessive.

TF calculation factor represents the word frequency, which a word occurs in the document number. In general, the higher the frequency is the related documents and words, the weight should be given to this word more.

When calculating the frequency specific

frequency factor formula is: W = 1+log (TF)

TF-IDF overview

