ZCTextIndex is the current built-in full text index for Zope ZCatalog?. It supplanted the standard TextIndex as of Zope 2.6.0. It supports features like advanced relevance ranking, globbing support, boolean operators, phrase matching and a pluggable lexicon which is extensible to add additional text-processing features.
Lexicon
The lexicon is a supporting object which derives the words for one or more ZCTextIndexes. You must create at least one ZCTextIndex Lexicon object before you can create a ZCTextIndex.
Pipeline
Source text to be indexed passes through a pipeline of processors that can effect the text indexed in various ways. At a minimum, the source text is passed through a splitter which divides the text on word boundaries. Query text is also processed by the same pipeline. The pipeline is configured for a lexicon when it is created.
- Splitter: Two splitters are available with Zope, a simple whitespace splitter, and an HTML-aware splitter which removes HTML markup from the source text while splitting.
- Case Normalizer: Enable this for case-insensitive searches. Disable for case sensitive searches.
- Stop Word Remover: Removes common English words like articles, which are typically not helpful in narrowing searches and match most or all documents which reduces search speed. You can also choose to remove single letter words.
Relevance Ranking
There are two algorithms for scoring documents with ZCTextIndex:
- Okapi BM25 Rank: This is the best general purpose choice when the search text is much shorter than the text index. This is usually true for human generated queries, unless your users have diarrhea of the keyboard or something ;^)
- Cosine Rule: This is the classic algorithm from the Manage Gigabytes book. It works best when the query length is close to the length of the documents indexed. This makes it useful for finding documents that are "similar" to another document.
Boolean Operators
- and: document must match both terms on each side (the default if no operator is specified).
- or: document may match on either term on each side.
- and not: document matches the term on the left but not the term on the right. Note that not alone is not a legal operator.
Terms may be single words, phrases or multi-word operations in parenthesis.
Phrase Matching
Phrases are denoted by placing the words in double quotes. A phrase match also results if the words are separated by punctuation but not space (e.g. 2.7 or 10,000).
Resources
- CJKSplitter
- ZCTextIndex splitter that works with Chinese, Japanese, and Korean text
- FieldedTextIndex
- An new text index derived from ZCTextIndex for indexing content with multiple fields/schemas
- How to create a Lexicon and ZCTextIndex using Python
- You probably won't guess this right one the first time
- Other asian language splitters
- Anyone used these or care to comment?
- Source Code (in CVS)
- Yes, I'm fluent in more than 6 million forms of communication