Class
IndexIndices are data structures which contain preprocessing data of a fixed text (or set of texts). In combination
with a Finder or an VSTreeIterator it allows fast dictionary look-up and advanced computations.
All Subcl's | FMIndex, IndexDfi, IndexEsa, IndexQGram, IndexSa, IndexWotd, OpenAddressingQGramIndex, PizzaChiliIndex |
---|---|
Defined in | <seqan/index.h> |
Signature |
class Index<TText[, TSpec]>;
|
Interface Function Overview
-
void clear(index);
Resets all fibres of an index. -
TSize countSequences(index);
Return the number of sequences in an index' underlying text. -
TFibre getFibre(index, fibreTag);
Returns a specific fibre of a container. -
bool indexCreate(index, fibreTag[, algoTag]);
Creates a specific Fibre. -
TFibre indexRawText(position, index);
Shortcut for $getFibre(.., FibreRawText). -
bool indexRequire(index, fibreTag);
On-demand creation of a specific Fibre. -
bool indexSupplied(index, fibreTag);
Returns whether a specific Fibre is present. -
TFibre indexText(index);
Shortcut for getFibre(.., EsaText). -
TSize length(index);
Returns the number of characters in the underlying text of the index. -
bool open(index, fileName[, mode]);
This functions opens an index from disk. -
TValue rawtextAt(position, index);
Shortcut for value(indexRawText(..), ..). -
bool save(index, fileName[, mode]);
This functions saves an index to disk.
Interface Metafunction Overview
-
DefaultIndexCreator<TIndex, TFibre>::Type;
Default algorithm to create a demanded and not yet existing Fibre. -
DefaultIndexSpec<TText>::Type;
Default Index specialization type. -
DefaultIndexStringSpec<TIndex>::Type;
-
GetVSTreeIteratorTraits<TIterator>::Type
Default behaviour of goNext when no second parameter is given. -
Size<TIndex>::Type;
Returns the size type of an Index.
Detailed Description
An index contains various arrays or objects, also called fibres (see Fibre). These fibres are created on demand depending on the requirements of an algorithm. To force the fibre creation you can use the function indexCreate. The list of fibres is available in Fibre
Examples
The following code shows how to search for exact matches between the reference "tobeornottobe" and the pattern "to" with the means of a Finder.
#include <seqan/index.h>
using namespace seqan;
int main()
{
typedef Index<CharString> TIndex;
TIndex index("tobeornottobe");
Finder<TIndex> finder(index);
// The Finder object has a pointer to the first, current and last hit
// Each consecutive call sets the current pointer to the appropriate hit
while (find(finder, "to"))
std::cout << "Hit at position: " << position(finder) << std::endl;
return 0;
}
The result is as follows
Hit at position: 9 Hit at position: 0
This code shows how an index can be used with iterators to achieve a pre-order tree like traversal in DFS of the text "tobeornottobe". In order to do so a Top-Down History iterator is used.
#include <seqan/index.h>
using namespace seqan;
int main()
{
typedef Index<CharString> TIndex;
TIndex index("tobeornottobe");
Iterator<TIndex, TopDown<ParentLinks<> > >::Type it(index);
do
{
// Print the letters from the root to the current node
std::cout << representative(it) << std::endl;
if (!goDown(it) && !goRight(it))
while (goUp(it) && !goRight(it))
;
}
while (!isRoot(it));
return 0;
}
The result is as follows
be beornottobe e eornottobe nottobe o obe obeornottobe ornottobe ottobe rnottobe t tobe tobeornottobe ttobe
Note that you can also use specialized iterators such as:
Iterator<TIndex, TopDown<ParentLinks<PreOrder> > >::Type
or
Iterator<TIndex, TopDown<ParentLinks<PostOrder> > >::Type
You can achieve a post-order traversal like this:
Iterator<TIndex, TopDown<ParentLinks<Postorder> > >::Type it(index);
goBegin(it);
while (!atEnd(it))
{
std::cout << representative(it) << std::endl;
++it;
}
The result is as follows
beornottobe be eornottobe e nottobe obeornottobe obe ornottobe ottobe o rnottobe tobeornottobe tobe ttobe t
Interface Functions Detail
void clear(index);
Parameters
index
|
The index to be cleared. |
---|
Data Races
TSize countSequences(index);
Parameters
index
|
The index to return the number of sequences of. |
---|
Returns
TSize |
The number of sequences in the index' underlying text with TSize being the result Size. |
---|
Examples
The following example shows how to count characters of an index, determine the number of sequences involved and how to search for a pattern.
#include <seqan/index.h>
using namespace seqan;
int main()
{
typedef StringSet<String<char> > TText;
typedef Index<TText> TIndex;
TText text;
appendValue(text, "MISSISSIPPI");
appendValue(text, "MYMISSISAHAPPY");
TIndex index(text);
Finder<TIndex> finder(index);
std::cout << "The text has " << length(index) << " characters and consists of " << countSequences(index) <<
" sequences." << std::endl;
// The Finder object has a pointer to the first, current and last hit
// Each consecutive call sets the current pointer to the appropriate hit
while (find(finder, "MISS"))
std::cout << "Hit at position: " << position(finder) << std::endl;
return 0;
}
The output is as follows:
The text has 25 characters and consists of 2 sequences. Hit at position: < 1 , 2 > Hit at position: < 0 , 0 >
Data Races
TFibre getFibre(index, fibreTag);
Parameters
index
|
The container holding the fibre. |
---|---|
fibreTag
|
A tag that identifies the Fibre. Types: Index Esa Fibres, FM Index Fibres, Index Wotd Fibres, and Index QGram Fibres. |
Returns
TFibre |
A reference to the Fibre object. |
---|
Examples
The following example shows how to search for a pattern in a string.
#include <seqan/index.h>
using namespace seqan;
int main()
{
typedef StringSet<String<char> > TText;
typedef Index<TText> TIndex;
typedef SAValue<TIndex>::Type TSAValue;
TText text;
appendValue(text, "MISSISSIPPI");
appendValue(text, "MYMISSISAHAPPY");
TIndex index(text);
Iterator<TIndex, TopDown<> >::Type it(index);
goDown(it, "SSI");
std::cout << "SSI occurs in " << getFrequency(it) << " sequences." << std::endl;
// 1 alternative to print the location of hits
String<TSAValue> saPositions = getOccurrences(it);
for (unsigned i = 0; i < length(saPositions); ++i)
std::cout << "Hit in sequence " << saPositions[i].i1 << " at position " << saPositions[i].i2 << std::endl;
std::cout << "----------------------------" << std::endl;
// 2 alternative to print the location of hits
Pair<unsigned> hitInterval = range(it);
for (; hitInterval.i1 < hitInterval.i2; ++hitInterval.i1)
std::cout << "Hit in sequence " << getFibre(index, FibreSA())[hitInterval.i1].i1 <<
" at position " << getFibre(index, FibreSA())[hitInterval.i1].i2 << std::endl;
return 0;
}
The output is as follows:
SSI occurs in 2 sequences. Hit in sequence 0 at position 5 Hit in sequence 1 at position 4 Hit in sequence 0 at position 2 ---------------------------- Hit in sequence 0 at position 5 Hit in sequence 1 at position 4 Hit in sequence 0 at position 2
Data Races
See Also
bool indexCreate(index, fibreTag[, algoTag]);
Parameters
fibreTag
|
A tag that identifies the Fibre |
---|---|
algoTag
|
A tag that identifies the algorithm which is used to create the fibre. Default: The result of DefaultIndexCreator. |
index
|
The Index object holding the fibre. |
Returns
bool |
true on a success and false otherwise |
---|
indexCreate calls the fibre corresponding createXXX(..) function (e.g. createSuffixArray).
Data Races
TFibre indexRawText(position, index);
Parameters
position
|
A position in the array on which the value should be accessed. |
---|---|
index
|
The Index object. |
Returns
TFibre |
A reference or proxy to the value. |
---|
Data Races
bool indexRequire(index, fibreTag);
Parameters
index
|
The Index object holding the fibre. |
---|---|
fibreTag
|
A tag that identifies the Fibre |
Returns
bool |
true on a successful creation. |
---|
If the fibre already exists (indexSupplied is true) then indexRequire does nothing. If the fibre doesn't exist then indexCreate is called to create it.
Examples
The following code shows how the BWT of an text can be computed.
#include <seqan/index.h>
using namespace seqan;
int main()
{
Index<String<char> > index("MISSISSIPPI");
// Because indices are build on demand we force the index creation here.
indexRequire(index, FibreSA());
std::cout << "BWT\tSuffices" << std::endl;
for (unsigned i = 0; i < length(indexSA(index)); ++i)
{
unsigned textPos = (saAt(i, index) == 0) ? length(index) - 1 : saAt(i, index) - 1;
std::cout << textAt(textPos, index) << "\t" << suffix(indexText(index), textPos) << std::endl;
}
return 0;
}
The output is as follows:
BWT Suffices P PI S SIPPI S SISSIPPI M MISSISSIPPI I I P PPI I IPPI S SSIPPI S SSISSIPPI I ISSIPPI I ISSISSIPPI
Data Races
bool indexSupplied(index, fibreTag);
Parameters
index
|
The Index object holding the fibre. |
---|---|
fibreTag
|
A tag that identifies the Fibre Index Fibres. |
Returns
bool |
true, iff the fibre is present. |
---|
Data Races
TFibre indexText(index);
The result of this function when used on an Index<TText, FMIndex<TOccSpec, CompressText> > is not defined.
Parameters
index
|
The Index object holding the fibre. |
---|
Returns
TFibre |
A reference to the text of the index. |
---|
Examples
The following code shows how the BWT of a text can be computed.
#include <seqan/index.h>
using namespace seqan;
int main()
{
Index<String<char> > index("MISSISSIPPI");
// Because indices are build on demand we force the index creation here.
indexRequire(index, FibreSA());
std::cout << "BWT\tSuffices" << std::endl;
for (unsigned i = 0; i < length(indexSA(index)); ++i)
{
unsigned textPos = (saAt(i, index) == 0) ? length(index) - 1 : saAt(i, index) - 1;
std::cout << textAt(textPos, index) << "\t" << suffix(indexText(index), textPos) << std::endl;
}
return 0;
}
The output is as follows:
BWT Suffices P PI S SIPPI S SISSIPPI M MISSISSIPPI I I P PPI I IPPI S SSIPPI S SSISSIPPI I ISSIPPI I ISSISSIPPI
Data Races
TSize length(index);
Parameters
index
|
An index of a text. |
---|
Returns
TSize |
Returns the number of characters in the raw underlying text of the index with TSize being the result of the Size meta-function of Index. |
---|
Examples
The following example shows how to count characters of an index, determine the number of sequences involved and how to search for a pattern.
#include <seqan/index.h>
using namespace seqan;
int main()
{
typedef StringSet<String<char> > TText;
typedef Index<TText> TIndex;
TText text;
appendValue(text, "MISSISSIPPI");
appendValue(text, "MYMISSISAHAPPY");
TIndex index(text);
Finder<TIndex> finder(index);
std::cout << "The text has " << length(index) << " characters and consists of " << countSequences(index) <<
" sequences." << std::endl;
// The Finder object has a pointer to the first, current and last hit
// Each consecutive call sets the current pointer to the appropriate hit
while (find(finder, "MISS"))
std::cout << "Hit at position: " << position(finder) << std::endl;
return 0;
}
The output is as follows:
The text has 25 characters and consists of 2 sequences. Hit at position: < 1 , 2 > Hit at position: < 0 , 0 >
Data Races
bool open(index, fileName[, mode]);
Parameters
index
|
The index to be opened. |
---|---|
fileName
|
C-style character string containing the file name. |
mode
|
The combination of flags defining how the file should be opened.To open a file read-only, write-only or to read and write use OPEN_RDONLY, OPEN_WRONLY, or OPEN_RDWR.To create or overwrite a file add OPEN_CREATE.To append a file if existing add OPEN_APPEND.To circumvent problems, files are always opened in binary mode. Default: OPEN_RDWR | OPEN_CREATE | OPEN_APPEND. |
Returns
bool |
true on success and false otherwise. |
---|
Examples
The following code shows how the function open is used with indices.
#include <seqan/index.h>
using namespace seqan;
int main()
{
typedef StringSet<String<char> > TText;
typedef Index<TText> TIndex;
TText text;
appendValue(text, "MISSISSIPPI");
appendValue(text, "MYMISSISAHAPPY");
TIndex saveIndex(text);
// Because indices are build on demand we fore the index creation here.
indexCreate(saveIndex, FibreSA());
const char * tempFileName = SEQAN_TEMP_FILENAME();
std::cout << save(saveIndex, tempFileName) << std::endl;
// In a different program
TIndex openIndex;
std::cout << open(openIndex, tempFileName) << std::endl;
return 0;
}
The output is as follows:
1 1
Data Races
TValue rawtextAt(position, index);
The result of this function when used on an Index<TText, FMIndex<TOccSpec, Compress> > is not defined.
Parameters
index
|
The Index object. Types: Index |
---|---|
position
|
A position in the array on which the value should be accessed. |
Returns
TValue |
A reference or proxy to the value. The type is the result if the meta-function Reference. |
---|
Data Races
bool save(index, fileName[, mode]);
Parameters
index
|
The index to be opened. |
---|---|
fileName
|
C-style character string containing the file name. |
mode
|
The combination of flags defining how the file should be opened.To open a file read-only, write-only or to read and write use OPEN_RDONLY, OPEN_WRONLY, or OPEN_RDWR.To create or overwrite a file add OPEN_CREATE.To append a file if existing add OPEN_APPEND.To circumvent problems, files are always opened in binary mode. Default: OPEN_RDWR | OPEN_CREATE | OPEN_APPEND. |
Returns
bool |
true on success and false otherwise. |
---|
Examples
The following code shows how the function save is used with indices.
#include <seqan/index.h>
using namespace seqan;
int main()
{
typedef StringSet<String<char> > TText;
typedef Index<TText> TIndex;
TText text;
appendValue(text, "MISSISSIPPI");
appendValue(text, "MYMISSISAHAPPY");
TIndex saveIndex(text);
// Because indices are build on demand we fore the index creation here.
indexCreate(saveIndex, FibreSA());
const char * tempFileName = SEQAN_TEMP_FILENAME();
std::cout << save(saveIndex, tempFileName) << std::endl;
// In a different program
TIndex openIndex;
std::cout << open(openIndex, tempFileName) << std::endl;
return 0;
}
The output is as follows:
1 1