🚧

Work in Progress

This blog is still under development. Content and features may change.

🚧

Writing and Optimizing SQL Queries

TODO

4 min read

1. Setup environment

PostgreSQL:

docker run --name postgres-learning -e POSTGRES_PASSWORD=postgres -p 5434:5432 -d postgres:18
docker exec -it postgres-learning psql -U postgres

Create database and table:

psql (18.0 (Debian 18.0-1.pgdg13+3))
Type "help" for help.

postgres=# CREATE DATABASE pagila;
CREATE DATABASE
postgres=# \q

Creata all schema objects:

cat ~/Evgenii/tutorials/tech-blog/sql-queries/pagila/pagila-schema.sql | docker exec -i postgres-learning psql -U postgres -d pagila

Fill tables with data:

cat ~/Evgenii/tutorials/tech-blog/sql-queries/pagila/pagila-data.sql | docker exec -i postgres-learning psql -U postgres -d pagila

2. Connect to database

docker exec -it postgres-learning psql -U postgres -d pagila
pagila=# \dt
                      List of tables
 Schema |       Name       |       Type        |  Owner   
--------+------------------+-------------------+----------
 public | actor            | table             | postgres
 public | address          | table             | postgres
 public | category         | table             | postgres
 public | city             | table             | postgres
 public | country          | table             | postgres
 public | customer         | table             | postgres
 public | film             | table             | postgres
 public | film_actor       | table             | postgres
 public | film_category    | table             | postgres
 public | inventory        | table             | postgres
 public | language         | table             | postgres
 public | payment          | partitioned table | postgres
 public | payment_p2022_01 | table             | postgres
 public | payment_p2022_02 | table             | postgres
 public | payment_p2022_03 | table             | postgres
 public | payment_p2022_04 | table             | postgres
 public | payment_p2022_05 | table             | postgres
 public | payment_p2022_06 | table             | postgres
 public | payment_p2022_07 | table             | postgres
 public | rental           | table             | postgres
 public | staff            | table             | postgres
 public | store            | table             | postgres
(22 rows)
pagila=# EXPLAIN ANALYZE
SELECT * FROM rental WHERE customer_id = 123;
                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 Seq Scan on rental  (cost=0.00..350.55 rows=25 width=40) (actual time=0.217..2.923 rows=24.00 loops=1)
   Filter: (customer_id = 123)
   Rows Removed by Filter: 16020
   Buffers: shared hit=150
 Planning:
   Buffers: shared hit=85
 Planning Time: 0.636 ms
 Execution Time: 3.053 ms
(8 rows)

Let's execute a more complex query to find the top 10 most rented films in the last month:

pagila=# EXPLAIN ANALYZE
SELECT f.title, COUNT(*) as rentals
FROM film f
JOIN inventory i ON f.film_id = i.film_id
JOIN rental r ON i.inventory_id = r.inventory_id
WHERE r.rental_date >= '2005-08-01'  -- assume last month in data
GROUP BY f.title
ORDER BY rentals DESC
LIMIT 10;

Response:

Limit  (cost=766.41..766.43 rows=10 width=23) (actual time=27.328..27.333 rows=10.00 loops=1)
  Buffers: shared hit=247
  ->  Sort  (cost=766.41..768.91 rows=1000 width=23) (actual time=27.326..27.330 rows=10.00 loops=1)
        Sort Key: (count(*)) DESC
        Sort Method: top-N heapsort  Memory: 26kB
        Buffers: shared hit=247
        ->  HashAggregate  (cost=734.80..744.80 rows=1000 width=23) (actual time=27.034..27.125 rows=958.00 loops=1)
              Group Key: f.title
              Batches: 1  Memory Usage: 121kB
              Buffers: shared hit=244
              ->  Hash Join  (cost=219.57..654.58 rows=16044 width=15) (actual time=8.964..23.326 rows=16044.00 loops=1)
                    Hash Cond: (i.film_id = f.film_id)
                    Buffers: shared hit=244
                    ->  Hash Join  (cost=133.07..525.78 rows=16044 width=4) (actual time=4.562..14.615 rows=16044.00 loops=1)
                          Hash Cond: (r.inventory_id = i.inventory_id)
                          Buffers: shared hit=180
                          ->  Seq Scan on rental r  (cost=0.00..350.55 rows=16044 width=4) (actual time=0.066..5.764 rows=16044.00 loops=1)
                                Filter: (rental_date >= '2005-08-01 00:00:00+03'::timestamp with time zone)
                                Buffers: shared hit=150
                          ->  Hash  (cost=75.81..75.81 rows=4581 width=8) (actual time=3.673..3.674 rows=4581.00 loops=1)
                                Buckets: 8192  Batches: 1  Memory Usage: 243kB
                                Buffers: shared hit=30
                                ->  Seq Scan on inventory i  (cost=0.00..75.81 rows=4581 width=8) (actual time=0.132..2.259 rows=4581.00 loops=1)
                                      Buffers: shared hit=30
                    ->  Hash  (cost=74.00..74.00 rows=1000 width=19) (actual time=4.241..4.242 rows=1000.00 loops=1)
                          Buckets: 1024  Batches: 1  Memory Usage: 60kB
                          Buffers: shared hit=64
                          ->  Seq Scan on film f  (cost=0.00..74.00 rows=1000 width=19) (actual time=0.013..4.046 rows=1000.00 loops=1)
                                Buffers: shared hit=64
Planning:
  Buffers: shared hit=427
Planning Time: 11.309 ms
Execution Time: 28.084 ms

The query execution plan shows that the database is performing sequential scans on the rental, inventory, and film tables, which can be inefficient for large datasets. To optimize this query, we can consider adding indexes on the columns used in the JOIN and WHERE clauses.

3. Optimize with Indexes

CREATE INDEX idx_rental_date ON rental(rental_date);
CREATE INDEX idx_inventory_film ON inventory(film_id);

After creating the indexes, we can re-run the same query and analyze the execution plan again to see if there are improvements in performance.

Planning:
  Buffers: shared hit=27
Planning Time: 0.502 ms
Execution Time: 10.109 ms

The execution time has significantly decreased from 28.084 ms to 10.109 ms after adding the indexes, demonstrating the effectiveness of indexing in optimizing SQL queries.

Conclusion

Optimizing SQL queries is crucial for improving database performance, especially as data volume grows. By analyzing query execution plans and implementing appropriate indexing strategies, we can significantly reduce query execution times and enhance overall application performance.

Writing and Optimizing SQL Queries

Optimizing SQL queries is essential for ensuring efficient data retrieval and overall database performance. In this blog post, we will explore various techniques and best practices for writing and optimizing SQL queries.