COMP 479

Resources

Videos:

Links:

Notes

Definitions:

  • Token: a token is some cutout sequence of characters from a document. For example from “Mr. O’neil is a nice man”, some tokens can be “Mr.”, “Mr”, “O”, “Neil”, “O’Neil”, etc. Anything that appears in the document can be a token.
  • Type: a type is a grouping of the same tokens. So there is a type “Mr.” that can be found in the document many times in different tokens.
  • Term: a term is an actual item that makes it into the final dictionary. So whereas you might have a token “automatic” in your document, it might be represented in the dictionary with the term “automa” after processing, stemming, etc.
  • Term Frequency: How many times a particular term appears in a particular document
  • Collection Frequency: How many times a particular term appears in the whole collection.
  • Document Frequency: How many documents have a particular term
  • tf-idf: Term Frequency x Inverse Document Frequency. Increases as terms are more frequent in the document and as less documents have this term. The two values offset eachother (i.e. high tf is offset by high idf). The maximum tf-idf is when tf is high and idf is low.