Use Tsearch2 in PostgreSQL 8.3

Many in the implementation of on-site search is necessary to do a database query, using operator LIKE or regular expression.
With the release of PostgreSQL 8.3 full-text search extension tsearch2 was built into the kernel (this was previously it was written on habré), and it can be used to implement a database search that will be more functional. In fact, in the article mentioned above, shows an example of using this module, but I would like to share the experience of implementing full-text search of real project.

So, there is a table "news" containing the fields "title", "metaKeywords", "metaDescription" and "content".
You need to implement full text search, which will search a word in all table fields, and each field has its absolute priority.

We can create a database with UTF8 encoding, and make it the test table:
CREATE DATABASE "tsearch2"
WITH ENCODING = 'UTF8';
CREATE TABLE "news"
(
"newsId" Serial NOT NULL
"title" Varchar(1024) NOT NULL
"metaKeywords" Varchar(2048)
"metaDescription" Varchar(1024)
"content" Text NOT NULL
primary key ("newsId")
);

Next, you will add to this table any entry:
INSERT INTO "news" ("title", "metaKeywords", "metaDescription", "content")
VALUES ('Test news' 'news article' 'Test news for the search' 'Hello world');

Now you need to configure full-text search:
CREATE TEXT SEARCH DICTIONARY mydict_russian_ispell (
TEMPLATE = ispell,
DictFile = English,
AffFile = russian,
StopWords = russian
);

CREATE TEXT SEARCH CONFIGURATION public.mydict_ru (PARSER = default);
COMMENT ON TEXT SEARCH CONFIGURATION public.mydict_ru IS 'conf. for mydict EN';

ALTER TEXT SEARCH CONFIGURATION mydict_ru ADD MAPPING
FOR email, url, url_path, host file, version
sfloat float int, uint
numword, hword_numpart, numhword
WITH simple;

ALTER TEXT SEARCH CONFIGURATION mydict_ru ADD MAPPING
FOR word hword_part, hword
WITH mydict_russian_ispell;

In this script we have created your ispell dictionary mydict_russian_ispell. Then created the configuration mydict_ru, which stated the rules for the use of dictionaries.
It should say that in the first query uses links to files lying in postgres in the default directory (\share\tsearch_data\).
To create his dictionary we need 3 files: English.affix English.stop and russian.dict. The first contains a description of the terminations of words of the language, the second is a list of stop words, the last words themselves. It is important to remember that the files must be in the same encoding that the base itself, i.e., in our case UTF8.

When implementing search we will not go to database fields, and a special index which will contain information about the contents of the table.
ALTER TABLE "news" ADD COLUMN fts_news tsvector;
UPDATE "news" SET fts_news=
setweight( coalesce( to_tsvector('mydict_ru' "title")")'A') ||
setweight( coalesce( to_tsvector('mydict_ru', "metaKeywords"),")'B') ||
setweight( coalesce( to_tsvector('mydict_ru', "content"),")'D');

CREATE INDEX news_fts_idx ON "news" USING gist(fts_news);

vacuum analyze "news";

This script we created in the table one more field of type tsvector, which contains information about the four fields of the table, where the field "title" is taken with the highest priority is A, and the field "content" — the lowest D. Then, create a GIST index and did an update of the index.
As a result of our test record in the table corresponds to the next tsvector 'search':8C 'article':4B 'the news':2A,3B,6C 'test':1A,5C.

Well, now, and it's time for a test of our search. Fulfill the next request:
SELECT
"newsId",
ts_headline('mydict_ru' "title", q) as "title"
rank
FROM (
SELECT
"newsId",
"title",
q
ts_rank( fts_news, q ) as rank
FROM "news", plainto_tsquery('mydict_ru' 'news' ) q
WHERE fts_news @@ q
ORDER BY rank DESC
) AS foo;

The result is the string Test news. Highlighting found words implements function ts_headline; the ranking query — the ts_rank, the value of which we sort the results.

And of course we must not forget to make a trigger that will have the table "news" update field "fts_news":
CREATE OR REPLACE FUNCTION "updateNewsFTS"() RETURNS "trigger" AS '
DECLARE bUpdate boolean;
BEGIN
bUpdate = false;
IF (TG_OP = '
'INSERT'') THEN
bUpdate := true;
ELSEIF (TG_OP = '
'UPDATE'') THEN
IF NEW.title != OLD.title OR NEW.content != OLD.content OR NEW."metaKeywords" != OLD."metaKeywords" OR "NEW"."metaDescription" != OLD."metaDescription" THEN
bUpdate := true;
END IF;
END IF;

IF bUpdate = TRUE THEN
RAISE NOTICE '
'UPDATE '';
new.fts_news:=setweight( coalesce( to_tsvector('
'mydict_ru'', new.title),'"'),''A'') ||
setweight( coalesce( to_tsvector('
'mydict_ru'', new."metaKeywords"),'"'),''B'') ||
setweight( coalesce( to_tsvector('
'mydict_ru'', new."metaDescription"),'"'),''C'') ||
setweight( coalesce( to_tsvector('
'mydict_ru'', new.content),'"'),''D'');
END IF;
RETURN NEW;
END;
'
LANGUAGE 'plpgsql' VOLATILE;

CREATE TRIGGER "newsFTSTrigger"
BEFORE INSERT OR UPDATE ON "news"
FOR EACH ROW EXECUTE PROCEDURE "updateNewsFTS"();

the UPD
the

Performance


Since the review was asked to provide information about the performance, I decided to compare two types of the search tsearch and search for regular expressions.
The search is performed in VIEW linking data from four tables (about 5,400 records).

A query that uses tsearch:
SELECT
"id",
"type",
ts_headline('mydict_ru' "title", q) as "title"
( CASE WHEN trim( "foreword" ) = " THEN ts_headline('mydict_ru' "content", q)
ELSE ts_headline('mydict_ru', "foreword", q) END ) as "body"
"resourceTypes",
rank
SELECT
"id",
"type",
"title",
"foreword",
"content",
"resourceTypes",
q
ts_rank( fts_vector, q ) as rank
FROM "getSearchItems", plainto_tsquery('mydict_ru' 'physics' ) q
WHERE fts_vector @@ q
ORDER BY rank DESC
) AS foo;

Average execution time this query on my laptop is 2.35 seconds. The result contains 821 recording.

A query that uses regular expressions and implements the similarity of ranking results:
SELECT *
, (SELECT position( lower('physics') in lower("search"."title") ) as ' pos '
EXCEPT SELECT 0 as "pos" ) as "titlePosition"
, (select position( lower('physics') in lower("search"."metaKeywords") ) as ' pos '
EXCEPT SELECT 0 as "pos" ) as "metaKeywordsPosition"
, (SELECT position( lower('physics') in lower("search"."metaDescription") ) as ' pos '
EXCEPT SELECT 0 as "pos" ) as "metaDescriptionPosition"
, (SELECT position( lower('physics') in lower("search"."foreword") ) as ' pos '
EXCEPT SELECT 0 as "pos" ) as "forewordPosition"
, (SELECT position( lower('physics') in lower("search"."content") ) as ' pos '
EXCEPT SELECT 0 as "pos" ) as "contentPosition"
FROM (
SELECT
"id",
"type",
"title",
"metaKeywords"
"metaDescription"
"foreword",
"content"
FROM "getSearchItems"
WHERE ( lower("title") ~ lower('(.*)'||'physics'||'(.*)')
or lower( "metaKeywords" ) ~ lower('(.*)'||'physics'||'(.*)')
or lower( "metaDescription" ) ~ lower('(.*)'||'physics'||'(.*)')
or lower( "foreword" ) ~ lower('(.*)'||'physics'||'(.*)')
or lower( "content" ) ~ lower('(.*)'||'physics'||'(.*)') )
) as "search"
ORDER BY type ASC
, "titlePosition" ASC
, "metaKeywordsPosition" ASC
, "metaDescriptionPosition" ASC
, "forewordPosition" ASC
, "contentPosition" ASC;

Average execution time this query on my laptop is 1.5 seconds. The result contains 567 records.
Thus, when using tsearch I got a working time of 1.5 times the working time of "simple" search, but I got about 1.5 times more records that contain different forms of the word physics and ready for output to the template.

the

PS


This implementation of the search provides full text search for Russian words in the database.
In this configuration, the English words were not indexed. To fix this, you need to create another search configuration, I just provide the one that used himself.
CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = english,
AffFile = english,
StopWords = english
);

ALTER TEXT SEARCH CONFIGURATION mydict_ru ADD MAPPING
FOR asciiword
WITH english_ispell;

Related links:
the Archive file Russian ispell dictionaries (UTF8)
Official page tsearch2
an Example of using tsearch2 in postgresql-8.2
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Briefly on how to make your Qt geoservice plugin

Database replication PostgreSQL-based SymmetricDS

Yandex.Widget + adjustIFrameHeight + MooTools