DragonFly On-Line Manual Pages

Search: Section:  


WordList(3)           DragonFly Library Functions Manual           WordList(3)

NAME

WordList - abstract class to manage and use an inverted index file.

SYNOPSIS

#include <mifluz.h> WordContext context; WordList* words = context->List(); delete words;

DESCRIPTION

WordList is the mifluz equivalent of a database handler. Each WordList object is bound to an inverted index file and implements the operations to create it, fill it with word occurrences and search for an entry matching a given criterion. WordList is an abstract class and cannot be instanciated. The List method of the class WordContext will create an instance using the appropriate derived class, either WordListOne or WordListMulti. Refer to the corresponding manual pages for more information on their specific semantic. When doing bulk insertions, mifluz creates temporary files that contain the entries to be inserted in the index. Those files are typically named indexC00000000 temporary file is wordlist_cache_size / 2. When the maximum size of the temporary file is reached, mifluz creates another temporary file named indexC00000001 created 50 temporary file. At this point it merges all temporary files into one that replaces the first indexC00000000 to create temporary file again and keeps following this algorithm until the bulk insertion is finished. When the bulk insertion is finished, mifluz has one big file named indexC00000000 that contains all the entries to be inserted in the index. mifluz inserts all the entries from indexC00000000 into the index and delete the temporary file when done. The insertion will be fast since all the entries in indexC00000000 are already sorted. The parameter wordlist_cache_max can be used to prevent the temporary files to grow indefinitely. If the total cumulated size of the indexC* files grow beyond this parameter, they are merged into the main index and deleted. For instance setting this parameter value to 500Mb garanties that the total size of the indexC* files will not grow above 500Mb.

CONFIGURATION

For more information on the configuration attributes and a complete list of attributes, see the mifluz(3) manual page. wordlist_extend {true|false} (default false) If true maintain reference count of unique words. The Noccurrence method gives access to this count. wordlist_verbose <number> (default 0) Set the verbosity level of the WordList class. 1 walk logic 2 walk logic details 3 walk logic lots of details wordlist_page_size <bytes> (default 8192) Berkeley DB page size (see Berkeley DB documentation) wordlist_cache_size <bytes> (default 500K) Berkeley DB cache size (see Berkeley DB documentation) Cache makes a huge difference in performance. It must be at least 2% of the expected total data size. Note that if compression is activated the data size is eight times larger than the actual file size. In this case the cache must be scaled to 2% of the data size, not 2% of the file size. See Cache tuning in the mifluz guide for more hints. See WordList(3) for the rationale behind cache file handling. wordlist_cache_max <bytes> (default 0) Maximum size of the cumulated cache files generated when doing bulk insertion with the BatchStart() function. When this limit is reached, the cache files are all merged into the inverted index. The value 0 means infinite size allowed. See WordList(3) for the rationale behind cache file handling. wordlist_cache_inserts {true|false} (default false) If true all Insert calls are cached in memory. When the WordList object is closed or a different access method is called the cached entries are flushed in the inverted index. wordlist_compress {true|false} (default false) Activate compression of the index. The resulting index is eight times smaller than the uncompressed index.

METHODS

inline WordContext* GetContext() Return a pointer to the WordContext object used to create this instance. inline const WordContext* GetContext() const Return a pointer to the WordContext object used to create this instance as a const. virtual inline int Override(const WordReference& wordRef) Insert wordRef in index. If the Key() part of the wordRef exists in the index, override it. Returns OK on success, NOTOK on error. virtual int Exists(const WordReference& wordRef) Returns OK if wordRef exists in the index, NOTOK otherwise. inline int Exists(const String& word) Returns OK if word exists in the index, NOTOK otherwise. virtual int WalkDelete(const WordReference& wordRef) Delete all entries in the index whose key matches the Key() part of wordRef , using the Walk method. Returns the number of entries successfully deleted. virtual int Delete(const WordReference& wordRef) Delete the entry in the index that exactly matches the Key() part of wordRef. Returns OK if deletion is successfull, NOTOK otherwise. virtual int Open(const String& filename, int mode) Open inverted index filename. mode may be O_RDONLY or O_RDWR. If mode is O_RDWR it can be or'ed with O_TRUNC to reset the content of an existing inverted index. Return OK on success, NOTOK otherwise. virtual int Close() Close inverted index. Return OK on success, NOTOK otherwise. virtual unsigned int Size() const Return the size of the index in pages. virtual int Pagesize() const Return the page size virtual WordDict *Dict() Return a pointer to the inverted index dictionnary. const String& Filename() const Return the filename given to the last call to Open. int Flags() const Return the mode given to the last call to Open. inline List *Find(const WordReference& wordRef) Returns the list of word occurrences exactly matching the Key() part of wordRef. The List returned contains pointers to WordReference objects. It is the responsibility of the caller to free the list. See List.h header for usage. inline List *FindWord(const String& word) Returns the list of word occurrences exactly matching the word. The List returned contains pointers to WordReference objects. It is the responsibility of the caller to free the list. See List.h header for usage. virtual List *operator [] (const WordReference& wordRef) Alias to the Find method. inline List *operator [] (const String& word) Alias to the FindWord method. virtual List *Prefix (const WordReference& prefix) Returns the list of word occurrences matching the Key() part of wordRef. In the Key() , the string (accessed with GetWord() ) matches any string that begins with it. The List returned contains pointers to WordReference objects. It is the responsibility of the caller to free the list. inline List *Prefix (const String& prefix) Returns the list of word occurrences matching the word. In the Key() , the string (accessed with GetWord() ) matches any string that begins with it. The List returned contains pointers to WordReference objects. It is the responsibility of the caller to free the list. virtual List *Words() Returns a list of all unique words contained in the inverted index. The List returned contains pointers to String objects. It is the responsibility of the caller to free the list. See List.h header for usage. virtual List *WordRefs() Returns a list of all entries contained in the inverted index. The List returned contains pointers to WordReference objects. It is the responsibility of the caller to free the list. See List.h header for usage. virtual WordCursor *Cursor(wordlist_walk_callback_t callback, Object *callback_data) Create a cursor that searches all the occurrences in the inverted index and call ncallback with ncallback_data for every match. virtual WordCursor *Cursor(const WordKey &searchKey, int action = HTDIG_WORDLIST_WALKER) Create a cursor that searches all the occurrences in the inverted index and that match nsearchKey. If naction is set to HTDIG_WORDLIST_WALKER calls searchKey.callback with searchKey.callback_data for every match. If naction is set to HTDIG_WORDLIST_COLLECT push each match in searchKey.collectRes data member as a WordReference object. It is the responsibility of the caller to free the searchKey.collectRes list. virtual WordCursor *Cursor(const WordKey &searchKey, wordlist_walk_callback_t callback, Object * callback_data) Create a cursor that searches all the occurrences in the inverted index and that match nsearchKey and calls ncallback with ncallback_data for every match. virtual WordKey Key(const String& bufferin) Create a WordKey object and return it. The bufferin argument is used to initialize the key, as in the WordKey::Set method. The first component of bufferin must be a word that is translated to the corresponding numerical id using the WordDict::Serial method. virtual WordReference Word(const String& bufferin, int exists = 0) Create a WordReference object and return it. The bufferin argument is used to initialize the structure, as in the WordReference::Set method. The first component of bufferin must be a word that is translated to the corresponding numerical id using the WordDict::Serial method. If the exists argument is set to 1, the method WordDict::SerialExists is used instead, that is no serial is assigned to the word if it does not already have one. Before translation the word is normalized using the WordType::Normalize method. The word is saved using the WordReference::SetWord method. virtual WordReference WordExists(const String& bufferin) Alias for Word(bufferin, 1). virtual void BatchStart() Accelerate bulk insertions in the inverted index. All insertion done with the Override method are batched instead of being updating the inverted index immediately. No update of the inverted index file is done before the BatchEnd method is called. virtual void BatchEnd() Terminate a bulk insertion started with a call to the BatchStart method. When all insertions are done the AllRef method is called to restore statistics. virtual int Noccurrence(const String& key, unsigned int& noccurrence) const Return in noccurrence the number of occurrences of the string contained in the GetWord() part of key. Returns OK on success, NOTOK otherwise. virtual int Write(FILE* f) Write on file descriptor f an ASCII description of the index. Each line of the file contains a WordReference ASCII description. Return OK on success, NOTOK otherwise. virtual int WriteDict(FILE* f) Write on file descriptor f the complete dictionnary with statistics. Return OK on success, NOTOK otherwise. virtual int Read(FILE* f) Read WordReference ASCII descriptions from f , returns the number of inserted WordReference or < 0 if an error occurs. Invalid descriptions are ignored as well as empty lines.

AUTHORS

Loic Dachary loic@gnu.org The Ht://Dig group http://dev.htdig.org/

SEE ALSO

htdb_dump(1), htdb_stat(1), htdb_load(1), mifluzdump(1), mifluzload(1), mifluzsearch(1), mifluzdict(1), WordContext(3), WordDict(3), WordListOne(3), WordKey(3), WordKeyInfo(3), WordType(3), WordDBInfo(3), WordRecordInfo(3), WordRecord(3), WordReference(3), WordCursor(3), WordCursorOne(3), WordMonitor(3), Configuration(3), mifluz(3) local WordList(3)

Search: Section: