Skip to main content

Command Palette

Search for a command to run...

Search using PostgreSQL GIN indices

Updated
9 min read
Search using PostgreSQL GIN indices

Ablo.ai, like most applications worldwide, provides a search feature to look for various products or creators. As with any startup, Ablo.ai initially implemented search by simply filtering the PostgreSQL database using full-text search with LIKE statements. This approach works relatively fast only when the data size is limited. As the data scales and additional filtering features are added, PostgreSQL's default filtering becomes exponentially slower, making our API latency times unbearable. To solve problems like this, people typically think of tools like Elasticsearch, Solr, or other third-party services. However, we decided to work with what we had and exploit PostgreSQL to its fullest. This blog describes our research and decisions to choose PostgreSQL and GIN (Generalized Inverted Index) indices as our search engine over more industry-standard and popular tools.

Following is the image of GIN index data structure (https://pganalyze.com/blog/gin-index):

Choosing PostgreSQL GIN indexes for full-text search is mainly driven by ease of implementation and cost-effectiveness, as it provides indexed search speed directly within PostgreSQL without additional components. Since our codebase already runs on TypeORM with PostgreSQL, it only requires setting up the appropriate columns with a GIN index. This approach avoids the need to collect meaningful data, send it to third-party services, and then query those services via APIs or SDKs.

Given that we already have a PostgreSQL container with ample unused space, potential issues related to adding indices—such as increased storage—are minimal. PostgreSQL indices store extra data structures to enable faster data retrieval, which speeds up queries but also consumes additional storage. For example, GitLab encountered issues with large GIN indexes causing occasional slow updates due to the overhead of cleaning up the GIN pending list, sometimes resulting in multi-second stalls during operations (source). These issues arose from the heavy write workload on their large indices, requiring complex tuning and maintenance strategies to keep performance acceptable. However, this situation does not apply to us for the foreseeable future, as our database workload and size are much smaller, and we have sufficient resources to manage GIN index overhead without experiencing such bottlenecks.

By default, PostgreSQL uses B-tree indices, which employ a balanced tree structure where each index entry points to a single row. B-tree indices work best with straightforward data types like numbers, dates, and single values. However, when dealing with complex data types that hold multiple values in one column—such as arrays, JSON documents, or full-text search data—GIN indexes are the better option as they efficiently handle these multi-valued structures.

Using third-party search tools like Elasticsearch provides powerful, highly scalable full-text search capabilities with advanced features such as fuzzy matching, ranking algorithms, autocomplete, and complex aggregations. These tools are built specifically for search, offering fast and relevant results even on massive datasets. However, they bring overhead in the form of additional infrastructure to deploy and maintain, requiring data synchronization between your main database and the search cluster, which increases complexity and operational costs. Moreover, you must manage separate backups, security, and ensure consistency between data stores, which can complicate your system architecture compared to using built-in database search features.

So, we decided to give GIN a try.

Setting up GIN in TypeORM

By creating a generated column in the same table that combines text fields such as name and description into a single tsvector using PostgreSQL’s to_tsvector with a specified configuration (e.g., ‘english’), we can automatically update this column on each insert or update. This generated column can then be indexed with a GIN index, which is designed for sub-dividable data like full-text lexemes, enabling fast and efficient search queries without manually converting text to vectors on each request.

Key benefits include automatic updates of the full-text vector on data changes, speeding up search execution via the GIN index, and simplifying queries as they can directly search the indexed generated column rather than calling to_tsvector dynamically. While generated columns must reference only columns from the same table (thus requiring separate columns and indexes for related entities like design and template), this approach still avoids overhead of on-the-fly vector calculation and provides clear performance advantages.

Example from the Design entity with a generated searchVector column and corresponding GIN index demonstrates this setup:

@Index('idx_gin_design_search_vector', ['searchVector'])
@Column({
  type: 'tsvector',
  generatedType: 'STORED',
  asExpression: `to_tsvector('english', COALESCE(name, '') || ' ' || COALESCE(description, ''))`,
  nullable: false,
  select: false,
  insert: false,
  update: false
})
searchVector: string;

After generating a migration for the searchVector column, a couple of inaccuracies were spotted. TypeORM does not support creating GIN indexes natively, so we had to manually tweak the index creation SQL to include the GIN index type. Additionally, when creating a generated column, TypeORM inserts full details from the database into the typeorm_metadata table, including the database name. This means that if your local database name differs from the initial database used for migrations, TypeORM will regenerate all code for that generated column whenever new migrations are created.

Fixed migration look like this (USING GIN is a very important tweak, without it you will get B-Tree index created instead):

  public async up(queryRunner: QueryRunner): Promise<void> {
    await queryRunner.query(`
            ALTER TABLE "design"
            ADD "search_vector" tsvector GENERATED ALWAYS AS (
                    to_tsvector(
                        'english',
                        COALESCE(name, '') || ' ' || COALESCE(description, '')
                    )
                ) STORED NOT NULL
        `)
    await queryRunner.query(
      `
            INSERT INTO "typeorm_metadata"(
                    "database",
                    "schema",
                    "table",
                    "type",
                    "name",
                    "value"
                )
            VALUES ($1, $2, $3, $4, $5, $6)
        `,
      [
        'image_generator',
        'public',
        'design',
        'GENERATED_COLUMN',
        'search_vector',
        "to_tsvector('english', COALESCE(name, '') || ' ' || COALESCE(description, ''))"
      ]
    )

    await queryRunner.query(`
      CREATE INDEX "idx_gin_design_search_vector" 
      ON "design" USING GIN ("search_vector")
  `)
  }

The data stored into search_vector for design with name Heart Eyes Bunny and description: A cute pink bunny with heart-shaped eyes radiating love. Perfect for expressing affection in a fun way! looks like this: 'affect':18 'bunni':3,7 'cute':5 'express':17 'eye':2,12 'fun':21 'heart':1,10 'heart-shap':9 'love':14 'perfect':15 'pink':6 'radiat':13 'shape':11 'way':22.

The text represents a tsvector column value, which stores a processed, searchable version of a text document for full-text search. Each word (called a lexeme), like ‘affect’, ‘bunni’, or ‘cute’, is listed with numbers that indicate the positions where that word appears in the original text. This positional data helps PostgreSQL optimize phrase and proximity searches. The GIN index uses these lexemes and their positions to quickly find rows containing the searched words without scanning the entire text, making full-text search efficient and fast.

Finally, we update the API to handle search:

      const sanitizedSearch = sanitizeSearchString(options.search)
      const searchTerm = this.prepareSearchTerm(sanitizedSearch)
      const fullTextExpr = `
        "design"."search_vector" @@ to_tsquery(:query)
      `
      queryBuilder
        .addSelect(`similarity(design.name, :searchTerm)`, 'name_similarity')
        .addSelect(fullTextExpr, 'fulltext_match')
        .andWhere(`(${fullTextExpr} OR design.name ILIKE :searchPattern)`, {
          query: searchTerm,
          searchPattern: `%${sanitizedSearch}%`,
          searchTerm: sanitizedSearch
        })
        .orderBy('fulltext_match', 'DESC')
        .addOrderBy('name_similarity', 'DESC')
        .addOrderBy('design.createdAt', 'DESC')

This code performs a full-text search combined with a similarity ranking and fallback pattern match on the “design” entity. First, it sanitizes the search input and prepares it as a PostgreSQL tsquery term. Then, it constructs a query that checks if the search_vector (a precomputed tsvector column) matches the search term using full-text search operators. The query adds two computed columns: fulltext_match to indicate if the full-text search matched, and name_similarity to measure similarity between the search term and the design name. It filters results where either the full-text search matches or the name contains the search text pattern (ILIKE). Finally, it orders results by the full-text match flag first (descending), then by the similarity score, and lastly by creation date, ensuring the most relevant and recent records appear first.

The results of the API

We haven’t noticed any slowdown in our POST or PATCH APIs in practice—though theoretically they may take slightly longer, the difference is imperceptible. Meanwhile, our search performs quickly and reliably, consistently returning accurate results for both Products and Creators. The API response time typically stays within 200-250 ms, and with caching in place, it becomes even faster over time as more requests are served directly from cache, improving efficiency with continued use.

And SQL query always provides around 0.041s (0.001s fetch) for query, while executing this into the server that is in a different region:

SELECT
  design.*,
  similarity(design.name, 'dogs') AS name_similarity,
  (design.search_vector @@ to_tsquery('dogs')) AS fulltext_match
FROM
  design
WHERE
  (design.search_vector @@ to_tsquery('dogs') OR design.name ILIKE '%dogs%')
ORDER BY
  fulltext_match DESC,
  name_similarity DESC,
  design.created_at desc
limit 5;

The query plan shows that the GIN index on the search_vector column is effectively utilized to accelerate full-text search. The plan includes a Bitmap Index Scan on the GIN index followed by a Bitmap Heap Scan on the design table, filtering rows that match the full-text condition or the trigram index on the name column. The sorting is performed with a top-N heapsort based on the full-text match, similarity score, and creation date. The entire query runs efficiently with an execution time of just over 2 milliseconds:

Limit  (cost=341.80..341.81 rows=5 width=551) (actual time=2.305..2.307 rows=5 loops=1)
  ->  Sort  (cost=341.80..342.15 rows=139 width=551) (actual time=2.305..2.306 rows=5 loops=1)
        Sort Key: ((search_vector @@ to_tsquery('dogs'::text))) DESC, (similarity((name)::text, 'dogs'::text)) DESC, created_at DESC
        Sort Method: top-N heapsort  Memory: 29kB
        ->  Bitmap Heap Scan on design  (cost=119.06..339.49 rows=139 width=551) (actual time=1.171..2.023 rows=135 loops=1)
              Recheck Cond: ((search_vector @@ to_tsquery('dogs'::text)) OR ((name)::text ~~* '%dogs%'::text))
              Heap Blocks: exact=108
              ->  BitmapOr  (cost=119.06..119.06 rows=139 width=0) (actual time=1.132..1.133 rows=0 loops=1)
                    ->  Bitmap Index Scan on idx_gin_design_search_vector  (cost=0.00..24.38 rows=138 width=0) (actual time=0.371..0.372 rows=135 loops=1)
                          Index Cond: (search_vector @@ to_tsquery('dogs'::text))
                    ->  Bitmap Index Scan on trgm_idx_name  (cost=0.00..94.61 rows=1 width=0) (actual time=0.760..0.760 rows=8 loops=1)
                          Index Cond: ((name)::text ~~* '%dogs%'::text)
Planning Time: 0.406 ms
Execution Time: 2.355 ms

This demonstrates how GIN indexes combined with similarity and trigram indexes can deliver fast, accurate full-text search results with efficient query execution.

To Sum Up

To sum up, at Ablo AI, we initially used PostgreSQL’s simple LIKE-based search, which worked only for small datasets. As we scaled and added features, this approach became too slow and caused high API latency. Instead of adopting third-party search tools like Elasticsearch, we decided to maximize PostgreSQL’s native capabilities by using GIN indexes for full-text search. This approach fits well with our existing TypeORM setup, requires less infrastructure overhead, and delivers fast, automated search vector updates and queries. We created generated columns combining text fields into tsvector columns that are automatically updated and indexed with GIN. While we had to manually adjust SQL to correctly create GIN indexes due to TypeORM limitations, the benefits are clear: fast, consistent search performance with minimal maintenance complexity. The tsvector columns store tokenized words with positions, allowing GIN to efficiently locate matches. Our API combines full-text search with similarity ranking and pattern matching, delivering results within 200-250 ms, with query plans confirming the effective use of GIN and trigram indexes in just a few milliseconds. This demonstrates how harnessing PostgreSQL’s full-text search capabilities with GIN indexes can provide scalable, high-performance search without the complexity of external search services.

Sources