Why Should I Care About All This Old Stuff?
- “Progress, far from consisting in change, depends on retentiveness. When change is absolute there remains no being to improve and no direction is set for possible improvement: and when experience is not retained, as among savages, infancy is perpetual. Those who cannot remember the past are condemned to repeat it.” – George Santayana
When you opened a savings account at a local bank branch, they grabbed an empty ledger card and an empty passbook that were pre-stamped with an available Account Number. They recorded your initial deposit amount on both as the opening balance, and collected your signature for future identification.
Whenever you wanted to deposit more money or withdraw some, you showed up at the same bank branch with your passbook. The teller took your passbook (and your cash or check for a deposit) and then fetched your account’s ledger card from bins on the back wall. She recorded the deposit or withdrawal on both the ledger card and in your passbook, and then returned the passbook (and any withdrawal cash) to you.
Once every 3 months the bank would compute your quarterly interest and add that as a line item on your ledger card. Then next time you showed up with your passbook for another deposit or withdrawal, the teller would add the interest earned line to your passbook before proceeding with the deposit or withdrawal transaction.
Punched Cards and Electro-Mechanical Tabulators
The Hollerith punched card provided some assistance for tabulating all the data recorded on the various account ledger cards.
A back-office clerk at each bank branch would periodically collect batches of the ledger cards and write down the Account Number and Current Balance information onto a pad of ledger paper organized as a big spreadsheet. (This is the original meaning of the word.)
Special mechanical adding machines were later invented to allow the data to be typed on special keyboards with digit keys aligned under each character position on the special ledger paper. A row of data was entered by depressing the correct digit keys, and then a big crank was pulled (like an older casino slot machine).
The crank motion turned a set of mechanical gears that read the set of pressed digit keys, printed the values into the ledger paper and advanced it one row. The gears also kept running totals for each data column of the spreadsheet.
When the big page of ledger paper was full, a lever was moved and the big crank was pulled again to print out the column totals and reset the accumulator gears back to zero.
When all the ledger card balances had been transcribed to ledger spreadsheets, the big sheets were sent to the main branch of the bank. There a group of keypunch operators would type the summary row from each ledger spreadsheet onto punched cards.
A radix sorting machine was invented to take a large tray full of punched card and sort them into sequence based on a numeric sort key entered into a fixed position on each card. The machine operator processed the cards thru the machine in several passes, one for each digit of the sort key. On the first pass, the machine sifted the cards from the input tray into ten output hoppers based on the one’s digit. The operator then collected the cards in order from the ten hoppers and piled them back into the input tray.
For the second pass, the sorting digit position was advanced to the ten’s digit and the process repeated. The sifting operation preserved the original order of the cards from the first pass, so now the cards were sorted on the combination of the right two digits.
This was repeated for each digit position of the sorting index, and the result was a sorted tray of punched cards. These could then be fed into a machine that printed out the data as a long report on fan-fold computer paper. Other machines from this era could perform simple electronic calculations on the punched card data and then punch out the results onto another set of punched cards.
Things starting getting less mechanical and more electronic when the decks of punched cards were fed into a card-reader and written to reels of 9-track magnetic tape. Each column of a punched card could encode either a digit, an upper- or lowercase letter, or a special character. The Extended Binary-Coded Decimal Interchange Code (EBCDIC) character set was invented by IBM to encode each character into 8 bits. A 9th error-detection parity bit was added and the 9 bits were written side-by-side to separate magnetic tracks on the tape to store the character.
Once the keypunched trays of cards were loaded onto tape, the tape was mounted onto a tape drive on the mainframe computer to be sorted. Two other empty reels of tape were mounted on separate tape drives on the same computer.
Using its limited amount of random access memory, the computer would read the first section of card-images from the source tape, and sort them in memory using the equivalent of the modern quicksort algorithm. It then wrote the sorted card-images to the first scratch tape.
The computer read the next set of input records, sorted them, and wrote them to the second scratch tape. This process was repeated until all the source tape records had been read, sorted, and written out, alternating between each of the two scratch tapes. Marker records were written to the scratch tapes between each sub-sequence of sorted records.
The computer then rewound the source tape and signaled the human operater to mount another empty output tape in its place. The scratch tapes were also rewound. The merge phase then began.
The computer processed the first sorted sub-sequence from each of the two input scratch tapes and merged them into a larger sub-sequence on the third output tape. Since the input sub-sequences were already sorted, this merge was a simple process of deciding which of the two next-available input records comes next.
This merge phase repeated until all of the sub-sequences from the input scratch tapes had been merged into double-length sub-sequences on the output tape. All three tapes were rewound, and the larger sub-sequences from the output tape were alternately copied to each of the first two scratch tapes.
The merge phase repeated again and again until a single sorted sequence had been written to the third scratch tape. That tape was rewound, dismounted, and stored on a rack for further processing.
By the way, if you see old movies where the police detective goes to the data processing center to have them search for some criminal record, they always show the tape drives to represent the mainframe computer. That is because the tape reels are constantly moving and rewinding, so they look cool on film. Most likely the tapes were in the middle of one of these merge sort operations.
Batch Update Processing
The sorted output tape from the prior section usually contained a set of transaction records, such as savings account withdrawals and deposits, that needed to be applied to master records for each account. The master records were stored on another tape, already in sorted order.
A batch update involved mounting the old master tape onto one tape drive, the newly sorted transaction tape onto the second tape drive, and an empty new master tape onto the third tape drive.
The next transaction record was read, and the old master tape was processed until the matching account was found. As each skipped account record was read from the old master tape, it was copied to the new master tape without any changes. Once the matching account was found, the transaction update was applied and the revised account data was written to the new master tape.
If the old master did not contain the account referenced in the transaction, this could be determined immediately as a higher account number record appeared. If the transaction was a Create Account request, a new account was initialized and written to the new master tape. (If a Delete Account request made sense, the record from the old master was simply omitted from the new master.)
The data for each account might have consisted of a header record followed by different sub-types of detail records. For example, each deposit, withdrawal, and interest earned record might have been stored as records following the account header record. A record type indicator was included in each record, usually as the first character, to allow each type of record to be distinguished. Also, each record type might have been a different length. (Dependency on the 80-column limit of the original punched cards started to be relaxed as key-to-tape machines slowly replaced keypunches.)
The programming logic, most likely written in COBOL, had to consider all of these different record types and accomodate them with appropriate actions. Note that COBOL was the first commercial language to introduce structured records, and it had a special Redefines clause to allow for multiple record types.
The introduction of (relatively) large capacity hard disk drives started to change things, pushing magnetic tapes into the background as merely archival storage media.
Hard disks introduced random access to file storage. But initially the data previously stored and written to magnetic tapes was simply replaced with sequential files stored on disk. The same merge sorting and master/transaction batch updating algorithms were used, but they ran much faster without having to use tape drives.
The sequential data files still contained interspersed multiple record types. A given business entity was stored as a header record following by several types of repeating detail records.
Indexed Files and Random Access
Sequential processing was faster with disk files than with magnetic tapes, but it still forced processing to be performed in intermittent batch cycles.
The Indexed Seqential Access Method (ISAM) file was invented to help with this. A B-Tree index supplemented the data records in the file, allowing any record to be quickly read and updated using its index key value. To avoid obsoleting all prior processing logic, once the file was positioned on a record via the random access index, the logic could continue to processing following records in key value sort order.
All those multiple record types were still interspersed with their header records, and this complicated the conversion from sequential processing to truly random-access techniques. Hierarchical indexed files and database structures were invented to help with this.
Each set of detail record type associated with a header record was organized as a child collection associated with the parent record. Once the parent record was located via its key value, any child record could also be randomly accessed via its subordinate key value. Each child collection was separately indexable by domain-appropriate sub-keys. Child records could also contain collections of grand-child records, etc.
The Enigma of the Dual Hierarchy
Things got a little sticky when situations involving a dual hierarchy appeared. The following classic teaching example was explicitly created to demonstrate a dual hierarchy:
- A college enrollment database stores information about Teachers, Students, and Courses.
- A Teacher teaches several Courses.
- A Student attends several Courses.
So which business entity should be the parent of the hiararchy? Do we index on Teacher, attach a collection of all Courses they teach, and then attach sub-collections of Students in each Course? Or do we store Students at the top, containing Courses, and then store the Teacher record attached to the Course?
Either way we have redundant data — the same Course record needs to appear more than once if we have two Teachers assigned to the Course, or if we use Student as the parent entity. Similarly, the same Teacher or Student record would have to be stored in more than one place in one or the other hierarchical schema.
Network Model Databases
The next stage of database evolution was the network model database, which allowed for inter-entity relationships to be modeled as relationship sets. To quote the Wikipedia article: “… the network model allows each record to have multiple parent and child records, forming a generalized graph structure.”
Both hierarchical and network schema databases are often referred to as navigational databases. The processing logic locates a primary entity via its index key value, and then navigates to related records via sub-collections or relationship sets.
The network model database schema turned out to be a transitional form in the evolution from hierarchical to relational databases.
The relational model focused on simplifying the primary storage model for the data into multiple tables consisting of rows and columns. Each table contains rows representing a single record type. Each row in the table has the same columns (although some row/column cells can be left null to indicate an unknown value).
Relationships between record rows are defined by shared values stored in a column of the referencing row and another column in the referenced row. Normally these are treated as a foreign key reference to the primary key column of another table.
This simple and elegant storage structure of flat tables (flat referring to a single record type) related by shared data values allowed a formal mathematical analysis based on set theory. From this analysis arose the relational algebra, usually expressed using Structured Query Language (SQL).
The relationship sets from the network model didn’t actually disappear in relational databases. Instead they moved into the background in the form of primary key and alternate key indexes automatically managed by the database management system. The difference was that the logic programmer did not have to explicitly manage this data because it was derivable from the normalized tabular data.
The NoSQL Movement
So now we are arriving at a set of alternative post-relational database structures colloquially referred to as NoSQL. One commentor suggests a more descriptive name is No Join since one issue these models address is the processing cost of performing the foreign key/primary key joins in a relational database.
Several alternate models are in play, so it seems too early to make any strong judgements on their lasting value.
An initial impression I am forming is that the NoSQL approach may be ideal for caching query results out in the cloud. This approach still causes concerns regarding its ability to maintain consistency for update processing.
Command Query Responsiblity Segregation
I am just starting to review summary materials on the CQRS patterns. The key message at the root of all CQRS discussions is that updates to domain data benefit from different processing logic and data storage patterns than queries against domain data. Updates also benefit from being organized into task-oriented commands rather than anything goes record updates. Another observation is that queries dominate update commands in terms of frequency.
Given those core concepts behind CQRS, I can see traditional relational database processing remaining relevant to the update side of the equation. The relational model was designed to detect, prevent, and manage data inconsistencies and anomalies.
Once we move our attention to the query side of the equation, the NoSQL movement starts to look very appealing. Relational database applications often have separated the OLTP (On-Line Transaction Processing) database that supports updates from an OLAP (On-Line Analytical Processing) data warehouse that supports queries and reports.
OLAP data warehouses typically perform lots of data-duplicating denormalizations to improve query performance. Summaries and totals are often pre-calculated during the Extract-Transform-Load (ETL) processing that copies data from the OLTP to the OLAP data store.
So the NoSQL technologies appear quite relevant for supporting the high-volume query requirements of modern web applications. These NoSQL approaches are quite amenable to replication, partitioning, local caching and horizontal scaling.
Data on the Outside vs. Data on the Inside
The above heading is the title of an article by Pat Helland that appeared in MSDN Magazine in July, 2004. The outside vs. inside of the article refers to Service Oriented Architecture boundaries.
One key concept from the article is that data on outside is typically reference data and is best represented as an immutable versioned document representing a specific point in time. This view meshes well with some of the ideas I am learning from the CQRS summaries…