What is TF-IDF? (opens in new tab)

(michaelerasm.us)

63 pointshelium10y ago13 comments

13 comments

Lucene is moving away from TF-IDF to BM25 as the default. Pretty similar idea, but tends to performs a better with short content.

https://issues.apache.org/jira/browse/LUCENE-6789

https://en.wikipedia.org/wiki/Okapi_BM25

In the very limited test cases where I've compared them it hasn't mattered much, but other's results are pretty compelling.

https://www.elastic.co/blog/found-bm-vs-lucene-default-simil...

meeper1610y ago

Vector Space replacing or being combined with TF-IDF approaches is new way of summarizing and searching for meaning in documents...

http://52.11.1.7/TuataraSum/example_context_control-ml2.html

gibrown10y ago

Interesting. This basically uses the background word2vec data for the entire Web to provide more information and help with things like disambiguation, synonyms, etc? Am I understanding that correctly?

Maybe nit-picky thought, but its not clear to me that the TF-IDF part is what's doing a lot of extra lifting there.

Do you know of any good evaluations between using vector space data and other methods for summarization?

1 more reply

rohwer10y ago

Translate IDF to "how uncommon is this word in the corpus?"

TF-IDF is acronym soup, but mathematically simple: IDF is a scalar applied to a term's frequency. And in the comparison, the numerator is the document overlap score and the denominator is the square root of the two documents. For more, Stanford's natural language processing course is the bee's knees: https://class.coursera.org/nlp/lecture/preview

nathell10y ago

TF-IDF solves an important problem and it's good to know about.

However, in some applications, such as Latent Semantic Analysis (LSA) and its generalizations, there are practical alternatives such as log-entropy [1] that I've found to work better in practice.

[1]: http://link.springer.com/article/10.3758%2FBF03203370#page-1

rhema10y ago

Here's an interesting demo I made where you can type or paste in words to get a sense of their IDF ( http://tpoem.com/test/dict/test_dictionary.html ).

meeper1610y ago

It's also used in AI-based document summarization systems that are worth millions e.g.

Yahoo Paid $30 Million in Cash for 18 Months of Young Summly http://allthingsd.com/20130325/yahoo-paid-30-million-in-cash...

Google Buys Wavii For North Of $30 Million http://techcrunch.com/2013/04/23/google-buys-wavii-for-north...

yannyu10y ago

Also used extensively in Lucene based search engines such as Solr and Elastic. Most companies running search use a Lucene-based search engine.

https://lucene.apache.org/

http://lucene.apache.org/solr/

https://www.elastic.co/

wyldfire10y ago

Is this similar to the concept used by Amazon's "statistically improbable phrases" (word-based instead of n-gram based)?

EDIT: according to SO, yes: http://stackoverflow.com/a/2009546/489590

dangerlibrary10y ago

Yes! Although ...

"Wait a minute. Strike that. Reverse it. Thank you."

TF-IDF is old, and very cool. n-gram based extensions of it are a bit newer, but are likely implemented in almost exactly the same way. N-grams just require a lot more compute power because your corpus grows faster than a plain ol' bag of words.

languagehacker10y ago

Nice job explaining a fundamental IR algorithm.

j / k navigate · click thread line to collapse

13 comments

gibrown10y ago

Lucene is moving away from TF-IDF to BM25 as the default. Pretty similar idea, but tends to performs a better with short content.

https://issues.apache.org/jira/browse/LUCENE-6789

https://en.wikipedia.org/wiki/Okapi_BM25

In the very limited test cases where I've compared them it hasn't mattered much, but other's results are pretty compelling.

https://www.elastic.co/blog/found-bm-vs-lucene-default-simil...

meeper1610y ago

Vector Space replacing or being combined with TF-IDF approaches is new way of summarizing and searching for meaning in documents...

http://52.11.1.7/TuataraSum/example_context_control-ml2.html

gibrown10y ago

Maybe nit-picky thought, but its not clear to me that the TF-IDF part is what's doing a lot of extra lifting there.

Do you know of any good evaluations between using vector space data and other methods for summarization?

1 more reply

rohwer10y ago

Translate IDF to "how uncommon is this word in the corpus?"

nathell10y ago

TF-IDF solves an important problem and it's good to know about.

However, in some applications, such as Latent Semantic Analysis (LSA) and its generalizations, there are practical alternatives such as log-entropy [1] that I've found to work better in practice.

[1]: http://link.springer.com/article/10.3758%2FBF03203370#page-1

rhema10y ago

Here's an interesting demo I made where you can type or paste in words to get a sense of their IDF ( http://tpoem.com/test/dict/test_dictionary.html ).

meeper1610y ago

It's also used in AI-based document summarization systems that are worth millions e.g.

Yahoo Paid $30 Million in Cash for 18 Months of Young Summly http://allthingsd.com/20130325/yahoo-paid-30-million-in-cash...

Google Buys Wavii For North Of $30 Million http://techcrunch.com/2013/04/23/google-buys-wavii-for-north...

yannyu10y ago

Also used extensively in Lucene based search engines such as Solr and Elastic. Most companies running search use a Lucene-based search engine.

https://lucene.apache.org/

http://lucene.apache.org/solr/

https://www.elastic.co/

wyldfire10y ago

Is this similar to the concept used by Amazon's "statistically improbable phrases" (word-based instead of n-gram based)?

EDIT: according to SO, yes: http://stackoverflow.com/a/2009546/489590

dangerlibrary10y ago

Yes! Although ...

"Wait a minute. Strike that. Reverse it. Thank you."

languagehacker10y ago

Nice job explaining a fundamental IR algorithm.

j / k navigate · click thread line to collapse