Azure DB for PostgreSQL Single Server with pg_dump or a Python script

2021年11月24日 2026年4月24日

rio

I did it with DMS, but I'd also like to know how long it takes to do with pg_dump.

Loaded the same dummy data to Single Server with 32vCPU / 4096GB storage.

# time ./bulk-load.py 


real	85m49.610s
user	158m0.038s
sys	16m36.706s

It took about 5 hours to dump.

$ time pg_dump -Fc -h riosourcesingle.postgres.database.azure.com -p 5432 -d postgres -U rifujita@riosourcesingle -f users.sql

real	315m57.967s
user	113m58.987s
sys	7m7.770s

pg_dump has an option, '-j' to run in parallel. But the words, 'in parallel' means that each dump processes work for multiple tables. If you have the 'big' tables, it doesn't work well.

So, I wrote a Python script for parallelized '\COPY' for a single big table.

# cat bulk-copy.py
#!/usr/bin/env python3

import os
import subprocess
import psutil
import time
import datetime

def do_copy(num):
    com = "psql 'host=riosourcesingle.postgres.database.azure.com port=5432 dbname=postgres user=rifujita@riosourcesingle password={} sslmode=require' -c '\COPY (SELECT * FROM users WHERE id >= {} and id <= {}) TO dump-{:05d}.csv WITH CSV'".format(os.getenv('PGPASSWORD'), num * 1000000 + 1, (num + 1) * 1000000, num)
    subprocess.Popen(com, shell=True)

def count_psql():
    count = 0
    for p in psutil.process_iter():
        if p.name() == 'psql':
            count = count + 1
    return count

def main():
    print('{}: started.'.format(datetime.datetime.now()))
    total = 0
    while total < 1000:
        if count_psql() < 32*8: # for 32vCPU Single Server
            do_copy(total)
            total = total + 1
        time.sleep(1)

    while count_psql() != 0:
        print('processing...')
        time.sleep(5)

    print('{}: finished.'.format(datetime.datetime.now()))

if __name__ == "__main__":
    main()

As you can see, it took only 18 minutes.

$ ./bulk-copy.py

2021-11-24 09:37:31.550243: started.
COPY 1000000
COPY 1000000
<<snipped>>
processing...
COPY 1000000
2021-11-24 09:55:05.465884: finished.

カテゴリー: Azure

Azure DB for PostgreSQL Single Server with pg_dump or a Python script

Azure DB for PostgreSQL バルクINSERT, Single vs Flexible

PostgreSQLから並列COPYするスクリプト