I did it with DMS, but I’d also like to know how long it takes to do with pg_dump.
Loaded the same dummy data to Single Server with 32vCPU / 4096GB storage.
# time ./bulk-load.py
real 85m49.610s
user 158m0.038s
sys 16m36.706s
It took about 5 hours to dump.
$ time pg_dump -Fc -h riosourcesingle.postgres.database.azure.com -p 5432 -d postgres -U rifujita@riosourcesingle -f users.sql
real 315m57.967s
user 113m58.987s
sys 7m7.770s
pg_dump has an option, '-j' to run in parallel. But the words, ‘in parallel’ means that each dump processes work for multiple tables. If you have the ‘big’ tables, it doesn’t work well.
So, I wrote a Python script for parallelized '\COPY' for a single big table.
# cat bulk-copy.py
#!/usr/bin/env python3
import os
import subprocess
import psutil
import time
import datetime
def do_copy(num):
com = "psql 'host=riosourcesingle.postgres.database.azure.com port=5432 dbname=postgres user=rifujita@riosourcesingle password={} sslmode=require' -c '\COPY (SELECT * FROM users WHERE id >= {} and id <= {}) TO dump-{:05d}.csv WITH CSV'".format(os.getenv('PGPASSWORD'), num * 1000000 + 1, (num + 1) * 1000000, num)
subprocess.Popen(com, shell=True)
def count_psql():
count = 0
for p in psutil.process_iter():
if p.name() == 'psql':
count = count + 1
return count
def main():
print('{}: started.'.format(datetime.datetime.now()))
total = 0
while total < 1000:
if count_psql() < 32*8: # for 32vCPU Single Server
do_copy(total)
total = total + 1
time.sleep(1)
while count_psql() != 0:
print('processing...')
time.sleep(5)
print('{}: finished.'.format(datetime.datetime.now()))
if __name__ == "__main__":
main()
As you can see, it took only 18 minutes.
$ ./bulk-copy.py
2021-11-24 09:37:31.550243: started.
COPY 1000000
COPY 1000000
<<snipped>>
processing...
COPY 1000000
2021-11-24 09:55:05.465884: finished.