I recognize that TOAST can store the data in an efficient way. And I tested some large data, 8.9 billion records and 20TB data in PostgreSQL in the past. The following is the article I wrote about the test.
PostgreSQLに20TB・89億件のデータを入れてみる
But I also recognize that I need to check the actual size of the data in the graph database.
Test Scenario
After loading the data, I checked the size of the graph data with the following query in Appendix .
Relation | Size | # of records | avg size of records |
---|---|---|---|
Actor | 10403840 | 48115 | 216.2286189 |
Film | 57163776 | 222280 | 257.1701278 |
ACTED_IN | 123527168 | 767492 | 160.9491278 |
And each node and edge has the following properties.
Actor: 3 properties
Film: 6 properties
ACTED_IN: 3 properties
If a graph has the similar number of properties, the size of the items in the graph will be similar to the following:
Node: 237 Bytes
Edge: 161 Bytes
Estimation
Total number of nodes: 0.2 billion
Total relation size estimated: 200,000,000 * 237 / 1024(KB) / 1024(MB) / 1024(GB) = 44.1GB
Total number of edges: 0.4 billion
Total relation size estimated: 400,000,000 * 161 / 1024(KB) / 1024(MB) / 1024(GB) = 60.0GB
To be frank, the average size of nodes / edges depends on the number of properties and the complexity of them such as number, text.
Appendix
Sample code
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import asyncio
import os
from agefreighter import Factory
async def main():
class_name = "CSVFreighter"
instance = Factory.create_instance(class_name)
await instance.connect(
dsn=os.environ["PG_CONNECTION_STRING"],
max_connections=64,
)
await instance.load(
graph_name="AgeTester",
start_v_label="Actor",
start_id="ActorID",
start_props=["Actor"],
edge_type="ACTED_IN",
end_v_label="Film",
end_id="FilmID",
end_props=["Film", "Year", "Votes", "Rating"],
csv="./actorfilms.csv",
drop_graph=True,
)
if __name__ == "__main__":
asyncio.run(main())
Actual relation size
postgres=> SELECT pg_total_relation_size('"AgeTester"."Actor"') AS size;
size
----------
10403840
(1 row)
postgres=> select count(*) from "AgeTester"."Actor";
count
-------
48115
(1 row)
postgres=> SELECT pg_total_relation_size('"AgeTester"."Film"') AS size;
size
----------
57163776
(1 row)
Actual number of records
postgres=> select count(*) from "AgeTester"."Film";
count
--------
222280
(1 row)
postgres=> SELECT pg_total_relation_size('"AgeTester"."ACTED_IN"') AS size;
size
-----------
123527168
(1 row)
postgres=> select count(*) from "AgeTester"."ACTED_IN";
count
--------
767492
(1 row)
Relation Schema
postgres=> \d+ "AgeTester"."Actor";
Table "AgeTester.Actor"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
------------+---------+-----------+----------+-----------------------------------------------------------------------------------------------------------------+----------+-------------+--------------+-------------
id | graphid | | not null | _graphid(_label_id('AgeTester'::name, 'Actor'::name)::integer, nextval('"AgeTester"."Actor_id_seq"'::regclass)) | plain | | |
properties | agtype | | not null | agtype_build_map() | extended | | |
Indexes:
"Actor_id_idx" btree (id)
"Actor_properties_idx" gin (properties)
Inherits: "AgeTester"._ag_label_vertex
Access method: heap
postgres=> \d+ "AgeTester"."Film";
Table "AgeTester.Film"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
------------+---------+-----------+----------+---------------------------------------------------------------------------------------------------------------+----------+-------------+--------------+-------------
id | graphid | | not null | _graphid(_label_id('AgeTester'::name, 'Film'::name)::integer, nextval('"AgeTester"."Film_id_seq"'::regclass)) | plain | | |
properties | agtype | | not null | agtype_build_map() | extended | | |
Indexes:
"Film_id_idx" btree (id)
"Film_properties_idx" gin (properties)
Inherits: "AgeTester"._ag_label_vertex
Access method: heap
postgres=> \d+ "AgeTester"."ACTED_IN";
Table "AgeTester.ACTED_IN"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
------------+---------+-----------+----------+-----------------------------------------------------------------------------------------------------------------------+----------+-------------+--------------+-------------
id | graphid | | not null | _graphid(_label_id('AgeTester'::name, 'ACTED_IN'::name)::integer, nextval('"AgeTester"."ACTED_IN_id_seq"'::regclass)) | plain | | |
start_id | graphid | | not null | | plain | | |
end_id | graphid | | not null | | plain | | |
properties | agtype | | not null | agtype_build_map() | extended | | |
Indexes:
"ACTED_IN_end_id_idx" btree (end_id)
"ACTED_IN_start_id_idx" btree (start_id)
Inherits: "AgeTester"._ag_label_edge
Access method: heap
Sample Data
postgres=> select * from "AgeTester"."Actor" limit 1;
id | properties
-----------------+------------------------------------------------------------------
844424930131969 | {"id": "nm0000001", "Actor": "Fred Astaire", "ActorSerial": "1"}
(1 row)
postgres=> select * from "AgeTester"."Film" limit 1;
id | properties
------------------+-----------------------------------------------------------------------------------------------------------------
1125899906842625 | {"id": "tt0082449", "Film": "Ghost Story", "Year": "1981", "Votes": "7731", "Rating": "6.3", "FilmSerial": "1"}
(1 row)
postgres=> select * from "AgeTester"."ACTED_IN" limit 1;
id | start_id | end_id | properties
------------------+-----------------+------------------+-------------------------------------------------------------------
1407374883553281 | 844424930131969 | 1125899906842625 | {"Genre": "Commedy", "Director": "Joe Dummy", "ActedSerial": "1"}
(1 row)