__init__.py 3.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
  1. # Copyright 2007 Matt Chaput. All rights reserved.
  2. #
  3. # Redistribution and use in source and binary forms, with or without
  4. # modification, are permitted provided that the following conditions are met:
  5. #
  6. # 1. Redistributions of source code must retain the above copyright notice,
  7. # this list of conditions and the following disclaimer.
  8. #
  9. # 2. Redistributions in binary form must reproduce the above copyright
  10. # notice, this list of conditions and the following disclaimer in the
  11. # documentation and/or other materials provided with the distribution.
  12. #
  13. # THIS SOFTWARE IS PROVIDED BY MATT CHAPUT ``AS IS'' AND ANY EXPRESS OR
  14. # IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
  15. # MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
  16. # EVENT SHALL MATT CHAPUT OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
  17. # INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  18. # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA,
  19. # OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
  20. # LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
  21. # NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
  22. # EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  23. #
  24. # The views and conclusions contained in the software and documentation are
  25. # those of the authors and should not be interpreted as representing official
  26. # policies, either expressed or implied, of Matt Chaput.
  27. """Classes and functions for turning a piece of text into an indexable stream
  28. of "tokens" (usually equivalent to words). There are three general classes
  29. involved in analysis:
  30. * Tokenizers are always at the start of the text processing pipeline. They take
  31. a string and yield Token objects (actually, the same token object over and
  32. over, for performance reasons) corresponding to the tokens (words) in the
  33. text.
  34. Every tokenizer is a callable that takes a string and returns an iterator of
  35. tokens.
  36. * Filters take the tokens from the tokenizer and perform various
  37. transformations on them. For example, the LowercaseFilter converts all tokens
  38. to lowercase, which is usually necessary when indexing regular English text.
  39. Every filter is a callable that takes a token generator and returns a token
  40. generator.
  41. * Analyzers are convenience functions/classes that "package up" a tokenizer and
  42. zero or more filters into a single unit. For example, the StandardAnalyzer
  43. combines a RegexTokenizer, LowercaseFilter, and StopFilter.
  44. Every analyzer is a callable that takes a string and returns a token
  45. iterator. (So Tokenizers can be used as Analyzers if you don't need any
  46. filtering).
  47. You can compose tokenizers and filters together using the ``|`` character::
  48. my_analyzer = RegexTokenizer() | LowercaseFilter() | StopFilter()
  49. The first item must be a tokenizer and the rest must be filters (you can't put
  50. a filter first or a tokenizer after the first item).
  51. """
  52. from whoosh.analysis.acore import *
  53. from whoosh.analysis.tokenizers import *
  54. from whoosh.analysis.filters import *
  55. from whoosh.analysis.morph import *
  56. from whoosh.analysis.intraword import *
  57. from whoosh.analysis.ngrams import *
  58. from whoosh.analysis.analyzers import *