Recently I wanted to add basic text search to an application I as working on. It took me a while to figure out the right way to index columns for LIKE lookups, especially for indexing compound columns. Here’s how I did it.

Introduction

Search is often an integral part of any web app, but it’s also one of the parts that can cause performance problems. Users expect search to fast and to be accurate. Tools like Elastic Search or Solr are great at providing quick intelligent searches on large datasets.

However, we don’t always need such substantial dependencies in an app. Quite often, structuring and indexing the database correctly can keep your queries nice and fast.

I ❤ Postgres

The more I work with PostgreSQL the more it impresses me. Whenever I need something from it, it’s usually already there, I just cant always find how to do it. Indexing columns for LIKE queries was perfect example of this.

Implementation

I’ve got a very simple users database table populated with 1 million rows.

                                     Table "public.users"
   Column   |            Type             |                     Modifiers
------------+-----------------------------+----------------------------------------------------
 id         | integer                     | not null default nextval('users_id_seq'::regclass)
 username   | character varying           |
 first_name | character varying           |
 last_name  | character varying           |

Without Any Index

Doing a quick case-insensitive ILIKE search with no matches is pretty slow:

$> EXPLAIN ANALYSE SELECT COUNT(*) FROM users WHERE username ILIKE '%foo%';
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=21927.24..21927.25 rows=1 width=0) (actual time=737.523..737.523 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..21927.00 rows=96 width=0) (actual time=737.520..737.520 rows=0 loops=1)
         Filter: ((username)::text ~~* '%foo%'::text)
         Rows Removed by Filter: 1000000
 Planning time: 0.373 ms
 Execution time: 737.593 ms
(6 rows)

Time: 738.404 ms

Regular index

You might think that indexing this column with a standard btree index would help this search - it doesn’t. As you can see below, the query is just as slow. It’s doing a full table scan and ignoring the index completely.

The new index would be used if we’re doing a comparison search such as WHERE username = 'foo', but not if we’re doing a partial match using LIKE or ILIKE:

$> CREATE INDEX idx_users_username ON users (username);
Time: 15987.751 ms
$> EXPLAIN ANALYSE SELECT COUNT(*) FROM users WHERE username ILIKE '%foo%';

                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=21927.24..21927.25 rows=1 width=0) (actual time=752.271..752.271 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..21927.00 rows=96 width=0) (actual time=752.268..752.268 rows=0 loops=1)
         Filter: ((username)::text ~~* '%foo%'::text)
         Rows Removed by Filter: 1000000
 Planning time: 0.599 ms
 Execution time: 752.318 ms
(6 rows)

Time: 753.251 ms

Enter Postgres pg_trgm

The pg_trgm module provides functions and operators for determining the similarity of ASCII alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.

Postgres uses trigrams to break down strings into smaller chunks and index them efficiently. The pg_trgm module supports GIST or GIN indexes and as of Postgres version 9.1 these indexes support LIKE/ILIKE queries.

To use the pg_trm module, you need to enable the extension and create the index passing in the default gin_trgm_ops:

$> CREATE EXTENSION pg_trgm;
Time: 42.206 ms

$> CREATE INDEX trgm_idx_users_username ON users USING gin (username gin_trgm_ops);
Time: 7082.474 ms

Now running the same query takes only a fraction of the time. Just 2ms instead of 753ms! We can see from the query plan that it’s using the new trgm_idx_users_username index we created.

$> EXPLAIN ANALYSE SELECT COUNT(*) FROM users WHERE username ILIKE '%foo%';

                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=369.12..369.13 rows=1 width=0) (actual time=0.030..0.030 rows=1 loops=1)
   ->  Bitmap Heap Scan on users  (cost=12.75..368.88 rows=96 width=0) (actual time=0.026..0.026 rows=0 loops=1)
         Recheck Cond: ((username)::text ~~* '%foo%'::text)
         ->  Bitmap Index Scan on trgm_idx_users_username  (cost=0.00..12.72 rows=96 width=0) (actual time=0.024..0.024 rows=0 loops=1)
               Index Cond: ((username)::text ~~* '%foo%'::text)
 Planning time: 0.636 ms
 Execution time: 0.095 ms
(7 rows)

Time: 2.333 ms

Compound columns

In the table above we have users with a first_name and last_name.

Let’s say we can search in our app for users by name - If I enter in ‘John Doe’ into my search field I would expect it to return a user named ‘John Doe’ (assuming it exists).

To do this we need to use both first_name and last_name columns and concatenate them in the query:

SELECT * FROM users WHERE first_name || ' ' || last_name ILIKE '%John Doe%'

If we create two new indexes on both first_name and last_name columns like we did for the username they won’t be used by this query. The query is using a compound column so indexes on the individual columns won’t help.

$> CREATE INDEX trgm_idx_users_first ON users USING gin (first_name gin_trgm_ops);
Time: 4577.637 ms
$> CREATE INDEX trgm_idx_users_last ON users USING gin (last_name gin_trgm_ops);
Time: 4770.507 ms
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=27027.00..27027.01 rows=1 width=0) (actual time=1025.543..1025.544 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..26927.00 rows=40000 width=0) (actual time=1025.539..1025.539 rows=0 loops=1)
         Filter: ((((first_name)::text || ' '::text) || (last_name)::text) ~~* '%foo%'::text)
         Rows Removed by Filter: 1000000
 Planning time: 0.273 ms
 Execution time: 1025.591 ms
(6 rows)

Time: 1027.547 ms

Instead we need to create the index on the compound column we’re using in the query.

$> CREATE INDEX index_users_full_name
             ON users using gin ((first_name || ' ' || last_name) gin_trgm_ops);

Running the query again we can see the compound index being used we’re back to lightning fast speeds.

EXPLAIN ANALYSE SELECT COUNT(*) FROM users WHERE first_name || ' ' || last_name ILIKE '%foo%';

                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10605.00..10605.01 rows=1 width=0) (actual time=0.020..0.020 rows=1 loops=1)
   ->  Bitmap Heap Scan on users  (cost=378.01..10505.00 rows=40000 width=0) (actual time=0.018..0.018 rows=0 loops=1)
         Recheck Cond: ((((first_name)::text || ' '::text) || (last_name)::text) ~~* '%foo%'::text)
         ->  Bitmap Index Scan on index_users_full_name  (cost=0.00..368.01 rows=40000 width=0) (actual time=0.016..0.016 rows=0 loops=1)
               Index Cond: ((((first_name)::text || ' '::text) || (last_name)::text) ~~* '%foo%'::text)
 Planning time: 0.338 ms
 Execution time: 0.080 ms
(7 rows)

Time: 0.975 ms

Why we’re not using concat_ws

When I initially built the query I used concat_ws (concat with separator) string function to join first_name and last_name.

concat_ws(' ', first_name, last_name)

However, concat_ws cannot be used in an index as it’s not immutable. See this discussion for more information on that.

Conclusion

Once again digging into the rich features of Posgtres helps to stave of adding dependencies on other search tools. The pg_trgm module provides far more than just better indexing for text columns. It provides rich fuzzy searching capabilities and can handle full-text serach. This blog post gives some great ideas of what is capable with your application database.

Thanks to this post by @the_depesz for pointing me in the right direction.