Object-oriented databases ========================= What is a traditional database? Relational data model: Represent Store data in tuples (relations) Index tuples by primary key Secondary indexes index by other elements of the tuple Can do several operations on sets of tuples: Product: Concatenate every pair of tuples Take two sets, A = {a1, a2}, B = {b1, b2} Product is {a1b1, a1b2, a2b1, a2b2} Selection (restriction): Selects a subset (employees with salary < $50,000) Projection: Picks a subset of the fields of tuples, eliminating duplicates Join: (Combination of product, selection, and projection) E.g.: Produce list of pairs Example implementation: Use B-trees What is the object oriented data model? Much like object-oriented programming Objects have properties (fields), like data structures Objects have methods through which to access the fields Why is this very different from the relational model? couldn't you just like "methods" into relational database client? Inheritance Define multiple types of employee (manager, etc.) Can use an object without knowing exact type or even having code for methods at compile time Can enforce invariants better with methods in Database E.g., Inverse relations (supported by many OODBs): Employee has a "manager" property Manager has a "manages" set DB maintains them in sync automatically So can you just implement persistent objects in C++? Straw man: Mmap big file, and use different malloc for persistent objects Database might need to be accessed from different applications/platforms Structures might be different on different architectures endianness different, sizeof (long) different Can't guarantee that virtual addresses (and thus pointers) will be the same Suppose two applications create two DBs at same virtual address Then a third application must access the two databases What about inheritance--code for virtual functions Should be able to introduce new types accessible by old applications A single database might be larger than your entire virtual address space Serious problem on 32-bit machines Need some form of garbage collection... Must avoid dangling pointers when freeing objects Need transactions How do you know type of a raw file (could be responsibility of application, but big potential for mistakes) What is pointer swizzling? Objects stored on disk have disk pointers or OID pointers to each other When object is brought into memory: Object pointers are changed to real pointers Real pointers point to invalid virtual addresses If user accesses pointer to object that is not present Segmentation fault (SIGSEGV) DB runtime system catches SEGV, loads pointed to object into memory, Changes invalid pointer to point to object just loaded (SWIZZLING) Restart operation that faulted with valid pointer Often, pointers go through one level of indirection Makes it easier to evict or move objects without having to invalidate many pointers to them SHORE ===== What was wrong with Exodus? Files stored no type information... had to rely on "compile-time" typing Too easy to access data under wrong type (version mismatch, etc.) Tied to one programming language Sharing data between applications/platforms is hard (vtable different) Client server architecture lead to "client-level servers" What is problem with Figure 1? To much load on Client which must RPC for each access to storage In parallel environment, also need server-to-server communication No access control (+ clients can clobber metadata) Want to interact with Unix tools that use file system Basic architecture of SHORE SHORE Data Language (SDL) provides type descriptions Every processor runs a SHORE server (even if it has no storage) - Fig 2 Servers communicate amongst each other (peer-to-peer architecture) SHORE servers all trusted, but clients not Server is not just a program but an API to build value addes servers (VAS) Provides Unix-like namespace and access control for persistent objects How is a SHORE object implemented Object type Specified in SDL (see Figure 5) SDL compiler generates C++ (they claim could work with other languages) Data structures with fields that are properties of the object Methods that must be implemented in the native language Each object has reference to type object specifying interface type Each object can also have a heap (Fig 3). Why? For allocating memory--e.g., for variable-length strings Objects too expensive to allocate just because you changed a string For "dynamic values"--other "top-level" objects Dynamic values can point to each other with relative offsets Pointers swizzled in memory All dynamic objects in a heap freed when object is freed What is programmer interface like? Look at figures 5 & 6 What is a relationship? ref -- like a pointer, but has an OID set -- must support standard set operations, plus iterator What is inverse? Names field in other object that must be back pointer Why are objects all const by default? Need to set the dirty flag--that's what update () does. How do you access SHORE objects? Can access them by name But also by Object ID. What's an OID? 8-byte Volume identifier (volume location service gives you server) 8-byte Serial number Each server maps these logical OIDs onto "physical OIDs"--disk location On-disk OIDs are only 8 bytes--use proxy objects for inter-volume refs What happens to OID references if you delete an object They dangle, but that's okay, OID's are unique over all time What's in the SHORE namespace? Directories: like Unix name->inumber, only binds name->OID Every directory and object in a directory is a "registered object" Each registered object has unix-like owner, permissions, etc. Regular objects - deleted when link count goes to zero Pools - Contain anonymous (unregistered) objects All objects in a pool share same owner, permissions Pool objects must be accessed by OID Symlinks Cross references - A link to a particular OID (but can go stale) Useful to name anonymous objects Object types Modules - Like pools for types (anonymous, unregistered types) Client-Server architecture (fig 7) Servers are all trusted, and exchange pages with each other Servers function as page cache - allow "inter-transaction caching" (means low-latency access, but callbacks required for invalidation) "adaptive" locking dynamically switch between page- & object-level Servers hand out objects to local processes, enforces access control Servers handle concurrency control, locks, etc. Server that owns a page arbitrates locks for objects on that page Clients access server through Language-Independent Library (LIL) Clients cache individual objects Clients get objects from local server (possibly through shared mem) Clients swizzle pointers to point to object table (for efficient removal) Clients invalidate all cached objects at the end of the transaction Transaction commit Client sends commit message to local server If non-local pages involved, must use two-phase commit If local server has storage for log, acts as coordinator Else, uses another server as coordinator Parallelism - what is a Pset? Two-phase commit 1a. Coordinator sends VOTE-REQUEST 1b. Participant votes with VOTE-COMMIT or VOTE-ABORT 2a. Coordinator broadcasts GLOBAL-COMMIT or GLOBAL-ABORT 2b. Participants that voted VOTE-COMMIT wait for decision by coordinator Problem: nodes crash, coordinators crash Partial solution: Nodes can communicate with each other - If anyone didn't hear VOTE-REQUEST, all can abort - If anyone heard GLOBAL-COMMIT, all can commit - If no one in above to cases, must wait for coordinator to recover