DragonFly On-Line Manual Pages
WordList(3) DragonFly Library Functions Manual WordList(3)
NAME
WordList -
abstract class to manage and use an inverted index file.
SYNOPSIS
#include <mifluz.h>
WordContext context;
WordList* words = context->List();
delete words;
DESCRIPTION
WordList is the mifluz equivalent of a database handler. Each WordList
object is bound to an inverted index file and implements the operations
to create it, fill it with word occurrences and search for an entry
matching a given criterion.
WordList is an abstract class and cannot be instanciated. The List
method of the class WordContext will create an instance using the
appropriate derived class, either WordListOne or WordListMulti. Refer
to the corresponding manual pages for more information on their
specific semantic.
When doing bulk insertions, mifluz creates temporary files that contain
the entries to be inserted in the index. Those files are typically
named indexC00000000 temporary file is wordlist_cache_size / 2. When
the maximum size of the temporary file is reached, mifluz creates
another temporary file named indexC00000001 created 50 temporary file.
At this point it merges all temporary files into one that replaces the
first indexC00000000 to create temporary file again and keeps following
this algorithm until the bulk insertion is finished. When the bulk
insertion is finished, mifluz has one big file named indexC00000000
that contains all the entries to be inserted in the index. mifluz
inserts all the entries from indexC00000000 into the index and delete
the temporary file when done. The insertion will be fast since all the
entries in indexC00000000 are already sorted.
The parameter wordlist_cache_max can be used to prevent the temporary
files to grow indefinitely. If the total cumulated size of the indexC*
files grow beyond this parameter, they are merged into the main index
and deleted. For instance setting this parameter value to 500Mb
garanties that the total size of the indexC* files will not grow above
500Mb.
CONFIGURATION
For more information on the configuration attributes and a complete
list of attributes, see the mifluz(3) manual page.
wordlist_extend {true|false} (default false)
If true maintain reference count of unique words. The
Noccurrence method gives access to this count.
wordlist_verbose <number> (default 0)
Set the verbosity level of the WordList class.
1 walk logic
2 walk logic details
3 walk logic lots of details
wordlist_page_size <bytes> (default 8192)
Berkeley DB page size (see Berkeley DB documentation)
wordlist_cache_size <bytes> (default 500K)
Berkeley DB cache size (see Berkeley DB documentation) Cache
makes a huge difference in performance. It must be at least 2%
of the expected total data size. Note that if compression is
activated the data size is eight times larger than the actual
file size. In this case the cache must be scaled to 2% of the
data size, not 2% of the file size. See Cache tuning in the
mifluz guide for more hints. See WordList(3) for the rationale
behind cache file handling.
wordlist_cache_max <bytes> (default 0)
Maximum size of the cumulated cache files generated when doing
bulk insertion with the BatchStart() function. When this limit
is reached, the cache files are all merged into the inverted
index. The value 0 means infinite size allowed. See
WordList(3) for the rationale behind cache file handling.
wordlist_cache_inserts {true|false} (default false)
If true all Insert calls are cached in memory. When the WordList
object is closed or a different access method is called the
cached entries are flushed in the inverted index.
wordlist_compress {true|false} (default false)
Activate compression of the index. The resulting index is eight
times smaller than the uncompressed index.
METHODS
inline WordContext* GetContext()
Return a pointer to the WordContext object used to create this
instance.
inline const WordContext* GetContext() const
Return a pointer to the WordContext object used to create this
instance as a const.
virtual inline int Override(const WordReference& wordRef)
Insert wordRef in index. If the Key() part of the wordRef exists
in the index, override it. Returns OK on success, NOTOK on
error.
virtual int Exists(const WordReference& wordRef)
Returns OK if wordRef exists in the index, NOTOK otherwise.
inline int Exists(const String& word)
Returns OK if word exists in the index, NOTOK otherwise.
virtual int WalkDelete(const WordReference& wordRef)
Delete all entries in the index whose key matches the Key() part
of wordRef , using the Walk method. Returns the number of
entries successfully deleted.
virtual int Delete(const WordReference& wordRef)
Delete the entry in the index that exactly matches the Key()
part of wordRef. Returns OK if deletion is successfull, NOTOK
otherwise.
virtual int Open(const String& filename, int mode)
Open inverted index filename. mode may be O_RDONLY or O_RDWR.
If mode is O_RDWR it can be or'ed with O_TRUNC to reset the
content of an existing inverted index. Return OK on success,
NOTOK otherwise.
virtual int Close()
Close inverted index. Return OK on success, NOTOK otherwise.
virtual unsigned int Size() const
Return the size of the index in pages.
virtual int Pagesize() const
Return the page size
virtual WordDict *Dict()
Return a pointer to the inverted index dictionnary.
const String& Filename() const
Return the filename given to the last call to Open.
int Flags() const
Return the mode given to the last call to Open.
inline List *Find(const WordReference& wordRef)
Returns the list of word occurrences exactly matching the Key()
part of wordRef. The List returned contains pointers to
WordReference objects. It is the responsibility of the caller to
free the list. See List.h header for usage.
inline List *FindWord(const String& word)
Returns the list of word occurrences exactly matching the word.
The List returned contains pointers to WordReference objects. It
is the responsibility of the caller to free the list. See List.h
header for usage.
virtual List *operator [] (const WordReference& wordRef)
Alias to the Find method.
inline List *operator [] (const String& word)
Alias to the FindWord method.
virtual List *Prefix (const WordReference& prefix)
Returns the list of word occurrences matching the Key() part of
wordRef. In the Key() , the string (accessed with GetWord() )
matches any string that begins with it. The List returned
contains pointers to WordReference objects. It is the
responsibility of the caller to free the list.
inline List *Prefix (const String& prefix)
Returns the list of word occurrences matching the word. In the
Key() , the string (accessed with GetWord() ) matches any string
that begins with it. The List returned contains pointers to
WordReference objects. It is the responsibility of the caller to
free the list.
virtual List *Words()
Returns a list of all unique words contained in the inverted
index. The List returned contains pointers to String objects. It
is the responsibility of the caller to free the list. See List.h
header for usage.
virtual List *WordRefs()
Returns a list of all entries contained in the inverted index.
The List returned contains pointers to WordReference objects. It
is the responsibility of the caller to free the list. See List.h
header for usage.
virtual WordCursor *Cursor(wordlist_walk_callback_t callback, Object
*callback_data)
Create a cursor that searches all the occurrences in the
inverted index and call ncallback with ncallback_data for every
match.
virtual WordCursor *Cursor(const WordKey &searchKey, int action =
HTDIG_WORDLIST_WALKER)
Create a cursor that searches all the occurrences in the
inverted index and that match nsearchKey. If naction is set to
HTDIG_WORDLIST_WALKER calls searchKey.callback with
searchKey.callback_data for every match. If naction is set to
HTDIG_WORDLIST_COLLECT push each match in searchKey.collectRes
data member as a WordReference object. It is the responsibility
of the caller to free the searchKey.collectRes list.
virtual WordCursor *Cursor(const WordKey &searchKey,
wordlist_walk_callback_t callback, Object * callback_data)
Create a cursor that searches all the occurrences in the
inverted index and that match nsearchKey and calls ncallback
with ncallback_data for every match.
virtual WordKey Key(const String& bufferin)
Create a WordKey object and return it. The bufferin argument is
used to initialize the key, as in the WordKey::Set method. The
first component of bufferin must be a word that is translated to
the corresponding numerical id using the WordDict::Serial
method.
virtual WordReference Word(const String& bufferin, int exists = 0)
Create a WordReference object and return it. The bufferin
argument is used to initialize the structure, as in the
WordReference::Set method. The first component of bufferin must
be a word that is translated to the corresponding numerical id
using the WordDict::Serial method. If the exists argument is
set to 1, the method WordDict::SerialExists is used instead,
that is no serial is assigned to the word if it does not already
have one. Before translation the word is normalized using the
WordType::Normalize method. The word is saved using the
WordReference::SetWord method.
virtual WordReference WordExists(const String& bufferin)
Alias for Word(bufferin, 1).
virtual void BatchStart()
Accelerate bulk insertions in the inverted index. All insertion
done with the Override method are batched instead of being
updating the inverted index immediately. No update of the
inverted index file is done before the BatchEnd method is
called.
virtual void BatchEnd()
Terminate a bulk insertion started with a call to the BatchStart
method. When all insertions are done the AllRef method is called
to restore statistics.
virtual int Noccurrence(const String& key, unsigned int& noccurrence)
const Return in noccurrence the number of occurrences of the string
contained in the GetWord() part of key. Returns OK on success,
NOTOK otherwise.
virtual int Write(FILE* f)
Write on file descriptor f an ASCII description of the index.
Each line of the file contains a WordReference ASCII
description. Return OK on success, NOTOK otherwise.
virtual int WriteDict(FILE* f)
Write on file descriptor f the complete dictionnary with
statistics. Return OK on success, NOTOK otherwise.
virtual int Read(FILE* f)
Read WordReference ASCII descriptions from f , returns the
number of inserted WordReference or < 0 if an error occurs.
Invalid descriptions are ignored as well as empty lines.
AUTHORS
Loic Dachary loic@gnu.org
The Ht://Dig group http://dev.htdig.org/
SEE ALSO
htdb_dump(1), htdb_stat(1), htdb_load(1), mifluzdump(1), mifluzload(1),
mifluzsearch(1), mifluzdict(1), WordContext(3), WordDict(3),
WordListOne(3), WordKey(3), WordKeyInfo(3), WordType(3), WordDBInfo(3),
WordRecordInfo(3), WordRecord(3), WordReference(3), WordCursor(3),
WordCursorOne(3), WordMonitor(3), Configuration(3), mifluz(3)
local WordList(3)