Object-oriented databases
=========================

What is a traditional database?
  Relational data model:  Represent Store data in tuples (relations)
    <SSN, Name, Job Title, office #, salary>
    <office #, size, # of windows>
  Index tuples by primary key
  Secondary indexes index by other elements of the tuple
  Can do several operations on sets of tuples:
    Product:  Concatenate every pair of tuples
      Take two sets, A = {a1, a2}, B = {b1, b2}
      Product is {a1b1, a1b2, a2b1, a2b2}
    Selection (restriction):
      Selects a subset (employees with salary < $50,000)
    Projection:
      Picks a subset of the fields of tuples, eliminating duplicates
    Join:  (Combination of product, selection, and projection)
      E.g.:  Produce list of <Job Title, office size> pairs
  Example implementation:  Use B-trees

What is the object oriented data model?
  Much like object-oriented programming
    Objects have properties (fields), like data structures
    Objects have methods through which to access the fields
  Why is this very different from the relational model?
      couldn't you just like "methods" into relational database client?
    Inheritance
      Define multiple types of employee (manager, etc.)
      Can use an object without knowing exact type
        or even having code for methods at compile time
    Can enforce invariants better with methods in Database
      E.g., Inverse relations (supported by many OODBs):
        Employee has a "manager" property
        Manager has a "manages" set
        DB maintains them in sync automatically

So can you just implement persistent objects in C++?
Straw man: Mmap big file, and use different malloc for persistent objects
  Database might need to be accessed from different applications/platforms
    Structures might be different on different architectures
      endianness different, sizeof (long) different
    Can't guarantee that virtual addresses (and thus pointers) will be the same
      Suppose two applications create two DBs at same virtual address
      Then a third application must access the two databases
    What about inheritance--code for virtual functions
      Should be able to introduce new types accessible by old applications
  A single database might be larger than your entire virtual address space
    Serious problem on 32-bit machines
  Need some form of garbage collection...
  Must avoid dangling pointers when freeing objects
  Need transactions
  How do you know type of a raw file
    (could be responsibility of application, but big potential for mistakes)

What is pointer swizzling?
  Objects stored on disk have disk pointers or OID pointers to each other
  When object is brought into memory:
    Object pointers are changed to real pointers
    Real pointers point to invalid virtual addresses
  If user accesses pointer to object that is not present
    Segmentation fault (SIGSEGV)
    DB runtime system catches SEGV, loads pointed to object into memory,
      Changes invalid pointer to point to object just loaded (SWIZZLING)
    Restart operation that faulted with valid pointer
  Often, pointers go through one level of indirection
    Makes it easier to evict or move objects without having to
      invalidate many pointers to them


SHORE
=====

What was wrong with Exodus?
  Files stored no type information...  had to rely on "compile-time" typing
    Too easy to access data under wrong type (version mismatch, etc.)
    Tied to one programming language
    Sharing data between applications/platforms is hard (vtable different)
  Client server architecture lead to "client-level servers"
    What is problem with Figure 1?
      To much load on Client which must RPC for each access to storage
      In parallel environment, also need server-to-server communication
  No access control (+ clients can clobber metadata)
  Want to interact with Unix tools that use file system

Basic architecture of SHORE
  SHORE Data Language (SDL) provides type descriptions
  Every processor runs a SHORE server (even if it has no storage) - Fig 2
    Servers communicate amongst each other (peer-to-peer architecture)
    SHORE servers all trusted, but clients not
  Server is not just a program but an API to build value addes servers (VAS)
  Provides Unix-like namespace and access control for persistent objects

How is a SHORE object implemented
  Object type Specified in SDL (see Figure 5)
  SDL compiler generates C++ (they claim could work with other languages)
    Data structures with fields that are properties of the object
    Methods that must be implemented in the native language
    Each object has reference to type object specifying interface type
  Each object can also have a heap (Fig 3).  Why?
    For allocating memory--e.g., for variable-length strings
      Objects too expensive to allocate just because you changed a string
    For "dynamic values"--other "top-level" objects
      Dynamic values can point to each other with relative offsets
        Pointers swizzled in memory
      All dynamic objects in a heap freed when object is freed

What is programmer interface like?
  Look at figures 5 & 6
  What is a relationship?
    ref -- like a pointer, but has an OID
    set -- must support standard set operations, plus iterator
  What is inverse?  Names field in other object that must be back pointer
  Why are objects all const by default?
    Need to set the dirty flag--that's what update () does.

How do you access SHORE objects?
  Can access them by name
  But also by Object ID.  What's an OID?
    8-byte Volume identifier (volume location service gives you server)
    8-byte Serial number
  Each server maps these logical OIDs onto "physical OIDs"--disk location
    On-disk OIDs are only 8 bytes--use proxy objects for inter-volume refs
  What happens to OID references if you delete an object
    They dangle, but that's okay, OID's are unique over all time

What's in the SHORE namespace?
  Directories:  like Unix name->inumber, only binds name->OID
    Every directory and object in a directory is a "registered object"
    Each registered object has unix-like owner, permissions, etc.
  Regular objects - deleted when link count goes to zero
  Pools - Contain anonymous (unregistered) objects
    All objects in a pool share same owner, permissions
    Pool objects must be accessed by OID
  Symlinks
  Cross references - A link to a particular OID (but can go stale)
    Useful to name anonymous objects
  Object types
  Modules - Like pools for types (anonymous, unregistered types)

Client-Server architecture (fig 7)
  Servers are all trusted, and exchange pages with each other
  Servers function as page cache - allow "inter-transaction caching"
    (means low-latency access, but callbacks required for invalidation)
    "adaptive" locking dynamically switch between page- & object-level
  Servers hand out objects to local processes, enforces access control
  Servers handle concurrency control, locks, etc.
    Server that owns a page arbitrates locks for objects on that page
  Clients access server through Language-Independent Library (LIL)
  Clients cache individual objects
  Clients get objects from local server (possibly through shared mem)
  Clients swizzle pointers to point to object table (for efficient removal)
  Clients invalidate all cached objects at the end of the transaction

Transaction commit
  Client sends commit message to local server
  If non-local pages involved, must use two-phase commit
  If local server has storage for log, acts as coordinator
    Else, uses another server as coordinator

Parallelism - what is a Pset?

Two-phase commit
  1a. Coordinator sends VOTE-REQUEST
  1b. Participant votes with VOTE-COMMIT or VOTE-ABORT
  2a. Coordinator broadcasts GLOBAL-COMMIT or GLOBAL-ABORT
  2b. Participants that voted VOTE-COMMIT wait for decision by coordinator

  Problem:  nodes crash, coordinators crash
  Partial solution:  Nodes can communicate with each other
    - If anyone didn't hear VOTE-REQUEST, all can abort
    - If anyone heard GLOBAL-COMMIT, all can commit
    - If no one in above to cases, must wait for coordinator to recover