Overview - tsvector and tsquery types

What is it?

In PostgreSQL, tsvector and tsquery are special data types used for full-text search. tsvector stores processed text as a list of words with their positions, making it easy to search. tsquery represents search queries that can match words or phrases in tsvector data. Together, they help find text quickly and accurately in large databases.

Why it matters

Without tsvector and tsquery, searching text in databases would be slow and inaccurate, especially for big data. They solve the problem of finding relevant information fast by indexing and matching words efficiently. This makes applications like search engines, document management, and content filtering work smoothly and reliably.

Where it fits

Before learning tsvector and tsquery, you should understand basic SQL queries and text data types. After mastering these, you can explore advanced full-text search features, indexing strategies, and performance tuning in PostgreSQL.

Mental Model

Core Idea

tsvector stores searchable text as a list of words and positions, while tsquery represents the search conditions to find matching words in that list.

Think of it like...

Imagine a book index (tsvector) listing all important words and their page numbers, and a reader's search request (tsquery) asking for pages containing certain words or phrases.

┌─────────────┐       ┌─────────────┐
│   tsvector  │       │   tsquery   │
│ (word list) │       │ (search)    │
│ word1:pos1 │       │ 'word1' &   │
│ word2:pos5 │  <--> │ 'word2' |   │
│ word3:pos9 │       │ 'word3'     │
└─────────────┘       └─────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding tsvector basics

Concept: tsvector stores text as normalized words with positions for searching.

tsvector breaks text into words, removes common words (stop words), and stores each word with its position in the text. For example, 'The quick brown fox' becomes 'quick':2, 'brown':3, 'fox':4.

Result

A compact representation of text optimized for search.

Understanding how text is broken down and stored helps grasp how searches can be fast and precise.

2

FoundationIntroduction to tsquery type

3

IntermediateMatching tsquery against tsvector

4

IntermediateUsing dictionaries and normalization

5

IntermediateCreating indexes for full-text search

6

AdvancedCombining tsquery operators for complex searches

7

ExpertInternal storage and optimization of tsvector

Under the Hood

PostgreSQL processes input text by parsing it into lexemes, normalizing them using dictionaries, and storing them with positional data in tsvector. tsquery expressions are parsed into logical trees. The @@ operator matches tsquery trees against tsvector lexemes using index-assisted search, enabling fast filtering and ranking.

Why designed this way?

This design balances storage efficiency, search speed, and flexibility. Storing lexemes with positions allows phrase and proximity searches. Using separate types for data and queries enables complex logical expressions. Alternatives like scanning raw text are too slow for large data.

Input Text ──> Parser ──> Lexemes + Positions ──> tsvector Storage

Search Query ──> tsquery Parser ──> Logical Tree

          ┌───────────────┐
          │   tsvector    │
          │ (lexemes +pos)│
          └──────┬────────┘
                 │
          ┌──────▼────────┐
          │   tsquery     │
          │ (logical tree)│
          └──────┬────────┘
                 │
          ┌──────▼────────┐
          │  @@ Operator  │
          │  Matching     │
          └───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Does tsquery match substrings inside words by default? Commit yes or no.

Common Belief:tsquery matches any part of a word, so searching 'run' finds 'running' automatically.

Tap to reveal reality

Quick: Does tsvector store the original text for retrieval? Commit yes or no.

Common Belief:tsvector stores the full original text so you can retrieve it from the index.

Tap to reveal reality

Quick: Can you use tsquery to search across multiple languages without changes? Commit yes or no.

Common Belief:tsquery works the same for all languages without special setup.

Tap to reveal reality

Quick: Does creating a tsvector index automatically update when text changes? Commit yes or no.

Common Belief:Once you create a tsvector index, it updates automatically with text changes without extra work.

Tap to reveal reality

Expert Zone

1

tsvector stores positions as 16-bit integers, limiting maximum document length for phrase searches.

2

GIN indexes on tsvector support fast existence checks but have slower updates compared to GiST indexes.

3

Prefix matching in tsquery requires special syntax (e.g., 'word:*') and affects index usage.

When NOT to use

Avoid tsvector/tsquery for unstructured binary data or when exact substring matching is required; use trigram indexes or external search engines like Elasticsearch instead.

Production Patterns

Commonly, applications store a tsvector column updated via triggers on text changes, create GIN indexes on it, and use tsquery with ranking functions like ts_rank to provide fast, relevant search results.

Connections

Inverted Index

tsvector and tsquery implement an inverted index pattern for text search.

Understanding inverted indexes from information retrieval explains why tsvector stores words with positions and how tsquery matches them efficiently.

Boolean Logic

tsquery expressions use Boolean operators to combine search terms.

Knowing Boolean logic helps build complex search queries with AND, OR, and NOT conditions for precise filtering.

Library Cataloging Systems

Both organize and index information for fast retrieval using keywords and categories.

Recognizing this connection shows how database text search is like organizing books by subject and keywords for quick lookup.

Common Pitfalls

#1Not updating tsvector column after text changes.

Wrong approach:UPDATE documents SET content = 'new text' WHERE id = 1; -- but tsvector column unchanged

Correct approach:UPDATE documents SET content = 'new text', tsv = to_tsvector('english', 'new text') WHERE id = 1;

Root cause:Misunderstanding that tsvector is not automatically synchronized with text columns.

#2Using plain text columns for search without tsvector.

Wrong approach:SELECT * FROM documents WHERE content LIKE '%searchword%';

Correct approach:SELECT * FROM documents WHERE tsv @@ to_tsquery('searchword');

Root cause:Not knowing that LIKE is slow and inaccurate for full-text search.

#3Expecting tsquery to match substrings without prefix operator.

Wrong approach:SELECT * FROM documents WHERE tsv @@ to_tsquery('run'); -- expects to match 'running'

Correct approach:SELECT * FROM documents WHERE tsv @@ to_tsquery('run:*'); -- prefix match

Root cause:Ignoring that tsquery matches lexemes exactly unless prefix matching is used.

Key Takeaways

tsvector stores normalized words with positions to enable fast full-text search.

tsquery expresses search conditions using logical operators to find matching words in tsvector.

Creating indexes on tsvector columns is essential for performance on large datasets.

Normalization and dictionaries improve search relevance by handling word forms and stop words.

Proper maintenance of tsvector data and understanding tsquery syntax are critical for accurate search results.