Scalability.

WORK STILL IN PROGRESS



Introduction

In order to investigate whether Oracle9i will suffice as database for storing pipeline results and sourcelists in the TeraByte regime a number of tests were devised. For these tests a number of objects were designed, which are a SOURCE and a SOURCE_TABLE object. The idea is to store the source parameters in a SOURCE object and the collection of SOURCE objects (which make up a sourcelist) in a SOURCE_TABLE . The different sourcelist are stored in a table of SOURCE_TABLE objects.
The objects are defined (in sql) as follows:

-- The SOURCE Object
CREATE TYPE SOURCE AS OBJECT (
RA NUMBER(10,7),
DEC NUMBER(10,7),
FLUX NUMBER,
NAME VARCHAR2(32),
ID NUMBER(32),
TEST NUMBER,
COLOR VARCHAR2(8),
NPAR NUMBER,
IPAR01 NUMBER(12),
IPAR02 NUMBER(12),
IPAR03 NUMBER(12),
.. ..
IPAR27 NUMBER(12),
IPAR28 NUMBER(12),
DPAR01 NUMBER,
DPAR02 NUMBER,
.. ..
DPAR27 NUMBER,
DPAR28 NUMBER
);

-- The SOURCE_TABLE Object
CREATE TYPE SOURCE_TABLE AS TABLE OF SOURCE;

-- The Table for all SOURCE_TABLE Objects
CREATE TABLE SLT (
ID NUMBER(32),
NAME VARCHAR2(32),
ORIGIN VARCHAR2(32),
SOURCES SOURCE_TABLE
)
NESTED TABLE SOURCES STORE AS SLT_SOURCES_TABLE;

So each source can have 64 parameters, and a nested table is used to store sourcelists in a table.

The Scalability Test.

100k sources from the USNO catalogue were taken from declination -82 o till -90o to give an adequate number of sources. The source parameters consist of position, flux and a number of randomly generated values. Each sourcelist (of 100k sources) in the table takes about 49MB of disk space which indicates that the Oracle 'storage overhead' is rather small.
The scalability was tested by ingesting 100k sources at a time and running a simple query on one of its parameters. The query is setup in such a way that each query will return only one tuple. The results are shown in figure 1a.
Figure 1a: Query on one Indexed Parameter which returns only 1 tuple. The query time is independant of the number of Sourcelists.



The next test is a Query which returns half of the total number of tuples. The results are shown in figure 1b.
Figure 1b: Query on one Indexed Parameter which returns half of all tuples. The query time is a linear increasing function with the number of Sourcelists.



The time needed to create an Index on the parameter on which the above described queries were run was also measured and the result is shown in figure 2a.
Figure 2a: Time to create an index on one Parameter as function of number of Sourcelists. The time needed increases linearly with database size.
Figure 2a shows that ingesting sourcelists into the database is becoming quite an expensive operation because of the re-creation of the indexes. Therefore we did some tests on partioned tables which allow for local indexes. For the test each sourcelist was stored in a separate partition. The sql script to create the partitioned tables is the following:



-- Create a partiononed table
CREATE TABLE PSLT (
ID NUMBER(32),
S SOURCE
)
PARTITION BY RANGE(ID) (
PARTITION PSLT_0 VALUES LESS THAN ( 1 )
);
QUIT
The extra parameter ID is used to distinguish the different sourcelists.
The time needed to create a local index is shown in figure 2b.
Figure 2b: Time to create a local index on one Parameter as function of number of Sourcelists. The time needed is independent of the database size.


Preliminary Conclusions and things to do.

The conclusions drawn so far are:

Some things are still under investigation, like: