Storage and Retrieval
Log structure storage engine
For write, append the new data(e.g. key, value pair) into the file.
For read, get the last occurrence from the file.
Hash Index
Maintain a in memory hash table. Key is the index key (e.g. the key if data is key, value pairs). Value is the offset of the value in file. So each time we want to read the data, we could get the file offset from hash table and find the starting offset where the latest data is stored.
Log file segment and compaction
If there are too much data cannot fit into single log file, we could break the log file into segments, and perform the compaction (throw away some duplicate keys). Each compaction makes the segment smaller, so that we could merge small segments into a new log file. Merging could happen in background, so the old segments could still serve the read, for writes it will be appended to new segment. Until the merging is done, old segments could be safely deleted.
Each segment has its own Hash Index Table
Things to consider
File format: Binary format would be better. (encode the length of a string in bytes, followed by the raw string)
Deleting records: Append a deletion record (tombstone).
Crash recovery: Store a snapshot of the in memory hash index table on disk
Partially written data: Include a checksum
Concurrency control: single writer(to keep the sequence), and multiple reader
Pros of append-only against updating old value
Appending operation is faster than random write
Crash recovery is much easier, since the log file has all the previous records
Merging keeps the log file tidy
Cons of hash index
Hash index table must fit in memory
Range queries are inefficient
page-oriented storage engine
Last updated