Last modified 5 years ago Last modified on 21/02/11 10:51:15

Text Indexing

Support

Support for text indexing has been available since 1.0.4. It's been reasonably well tested, and is used in production in a few systems, but there may well be bugs remaining.

Algorithms

There are three indexing algoithms available ini this feature, tokenising, double metaphones, and stemming.

Tokenising just breaks up strings into word tokens, so "Foo bar, baz" generates "foo", "bar", "baz".

Double metaphones calculates between one and two metaphones for each word, so "Foo bar, baz" generates "F", "PR", "PS".

Stemming removes suffices (plurals etc.) from the words, so "Foos, bar, baz's" generates "foo", "bar", baz".

Configuration

In order to configure the text indexing, write triples to a graph called <system:config>, eg like 4s-import $KB -m system:config path/to/config/file.ttl.

An example config is:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix text: <http://4store.org/fulltext#> .
@prefix ex: <http://example.org/text#> .

rdfs:label text:index text:dmetaphone .
ex:token text:index text:token .
ex:stem text:index text:stem .

This means that objects of the predicate rdfs:label will be indexed with double metaphones, objects of ex:token will be indexed with plain text (lowercase) tokens and ex:stem will be stemmed.

You can pick what language's stemming algorithm is used with language tags, e.g.:

<> ex:stem "Alle Menschen sind frei und gleich an Würde und Rechten geboren. Sie sind mit Vernunft und Gewissen begabt und sollen einander im Geist der Brüderlichkeit begegnen."@de .

will be stemmed using a German stemming algorithm.

Some examples of the text indexing, and how to query it are shown here: http://theno23.livejournal.com/17658.html

Example

With the config file above, and the following data:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix text: <http://4store.org/fulltext#> .
@prefix ex: <http://example.org/text#> .

<a> rdfs:label "foo bar, baz" .
<b> ex:token "Foo bar, baz" .
<c> ex:stem "Foos, bar, baz's" .

You will get the these triples.

<a> rdfs:label "foo bar, baz" ;
    text:dmetaphone "F", "PR", "PS" .
<b> ex:token "Foo bar, baz" ;
    text:token "bar", "baz", "foo" .
<c> ex:stem "Foos, bar, baz's" ;
    text:stem "bar", "baz", "foo" .

If you want to query it, you can use a query like:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX text: <http://4store.org/fulltext#>
SELECT ?x ?string
WHERE {
  ?x text:dmetaphone "PS" ;
     rdfs:label ?string .
}

Note: "PS" is the metaphone for "baz".

Which will give:

?x ?string
<a> "foo bar, baz"