In His Own Words; A Tribute to E.F. Codd
On Developing Software [1]
Codd: People who develop software (whether associated with IBM or not) need to take more pride in the quality of their products. Quality should not be measured solely by the average number of bugs per 1000 lines of code. Of at least equal importance is an evaluation of the overall design.
There should be clearly defined levels of abstraction (more abstract than the code itself), which define with some rigor what the code is claimed to do; provide a solid theoretical foundation for the product; and from which, many inferences can be drawn using deductive logic.
Prior to actual coding, these levels of abstraction should be defined and an estimate made concerning the prevalence of exceptions to be handled. It is strongly recommended that a target limit be placed on the percentage of lines of code in the ultimate product, which can be identified as having as its sole raison d'etre the handling of exceptions.
On Business Rules [2]
Question: One concern that many people have in analyzing the capabilities of DBMS's is the concept of business rules. An example of a business rule might be, for example, to instruct the DBMS not to allow deletion of a customer record while that customer still has active orders. [What is your view?]
Codd: ... It's a well-known need -- a gap let's say -- that should be filled and undoubtedly, will be filled. It's a very important area.
On Integrity Constraints [3]
Codd: ... The new approach to integrity control in a relational DBMS requires that: 1) all integrity constraints should be specified linguistically -- not by means of data structures, whether hierarchic or network or tabular -- and 2) the integrity constraint statements should be stored in the catalog (and NOT in the application programs). The principal benefits of this approach are that the integrity constraints specifiable in a fully relational DBMS:
- T1 are expressible in the same high-level data sublanguage used for queries and transactions;
- T2 are much more powerful than the class of integrity constraints expressible in a NON-relational DBMS (due to the power of the fully relational language being that of the three-valued, first-order predicate logic);
- T3 can be stated independently of the application programs and terminal activities.
As a consequence of T3 above, if business policies or governmental regulations change, the corresponding integrity constraints stored in the catalog can be changed accordingly -- without forcing any stoppage in the database traffic and without forcing any rewriting, recompiling, or re-debugging of application programs. This stems from the fact that changes in the integrity constraints stored in the catalog can usually be made without changing any of the application programs -- the application program stability benefit [adaptability of the total system to environmental changes].
This is, turn, partly due to the fact that there are no explicit calls from the application programs for the particular integrity constraints that need to be enforced. The constraints take the form of a condition followed by an action, such as "whenever an attempt is made to insert or update a row into table X, check that condition C holds in the database, and, if C does not hold, abort the transaction with an error message."
On Representing Data [4]
Codd: Twelve distinct ways of representing data at the logical level are eleven too many.
On Semantics [5]
Codd: First of all, I strongly disagree with those who claim that the relational model is devoid of semantics. Examples of semantic concepts are domains (including the constraints they place on comparison of values in executing row selects, joins, and relational division), primary keys, foreign keys, and integrity constraints.
More semantic features can and should be added to the relational model. However, it is important to remember that there is at present no non-subjective boundary to the subject of semantics. It is a never-ending task. Therefore, some way must be found to test the usefulness of proposed semantic features.
On the Trade Press [6]
Codd: ... It is high time that trade publications take note of the actual accomplishments of relational systems, instead of continually misrepresenting these systems as "paper tigers."
An Interview with Edgar F. Codd
The full text of this exclusive, groundbreaking and timeless interview with E.F. Codd appeared in the March 1982[7] and May 1982[8] issues of the Data Base Newsletter. E. F. Codd, a Fellow of IBM and winner of the 1981 Turing Award, is generally considered to be the father of relational data base management systems. His famous 1970 paper, "A Relational Model of Data for Large Shared Data Banks," set in motion a decade of research and development at IBM and elsewhere.... In the following exclusive interview, Dr. Codd talks in layman's terms about what the relational model really is. He answers critics and discusses the future.... You will undoubtedly find him a thoroughly thought-provoking and farsighted expert.
Question: Let me congratulate you for winning the 1981 Turing Award. Unfortunately, your landmark 1970 paper is too mathematical for most of us to come to grips with.
Let's start off with a relatively simple question: How do you define a relational DBMS?
Codd: First, let me defend what you call "mathematical," but what I would call the "precise" nature of that paper. Prior to writing the paper, I had participated in many discussions on data base management with researchers and customers alike. I found those discussions altogether too fuzzy and imprecise and consequently felt that an urgent need existed to bring a lot more precision into the area. The only way to do that was to dip into the mathematics of the subject.
Unfortunately, not all of the necessary mathematics were available because mathematicians had dismissed any problem dealing with relations of degree higher than two as reducible to a corresponding problem in binary relations -- that is, relations of degree two. But it was the higher-degree relations that were more applicable to data base management.
Question: A point of clarification: When you talk about relations of higher degree is that equivalent to having three or more fields in a flat file?
Codd: They are very similar, but not identical. Normally, with a flat file you are allowed to do several things that are not permitted in a relation. For example, the ordering of records in a flat file may be information-bearing. In other words, if you were to suddenly lose the ordering, information would be lost. This is not true of a base relation in the relational model.
Again, in a flat file you may intermix records of different types (so long as they are all non-hierarchic). This is not permitted in a relation. One way to think of a relation is that it is a highly disciplined flat file.
Question: How do you distinguish a relational system from a non-relational one?
Codd: I suppose the best way to distinguish a relational system from one that is not is to examine the system in terms of its fidelity to the relational model.
The relational model has three parts to it: the structural part, the manipulative part, and the integrity part. It is often viewed as if it had on1y the structural part, but that is inaccurate.
It's in the manipulative part that you see dramatic variations in products that claim to be relational. Some of them deal only with the tables one row at a time -- in other words, record-by-record processing. Others do have a certain number of higher-level operations, but these are limited to predefined access paths. This is still not a relational system.
It's only when you have the capability of unrestricted "join" between tables that you begin to get what I would call a minimum relational system. By unrestricted, I mean not restricted in any way by predefined access paths. Of course, the data base structure must comply with relational rules also.
Question: In your own words, describe the join operator.
Codd: A join is the pasting together of one table with another table, but subject to certain conditions. You bring together the rows of one table with the rows of the other where there are matches of values in specified columns -- the columns to be matched must take their values from a common domain. This is called an equi-join -- there are other types of join but this is one of the most important.
Question: So your definition of a relational system requires an unrestricted join capability?
Codd: Yes -- there can be no user-visible navigation links between tables. Any links "under the covers" must not restrict the user from making whatever joins the domains will allow.
Question: Does the end-user perform a join for the purpose of producing an output?
Codd: Well, the result of doing a join is another table, which need not be your final output. Actually, you may want to go on performing further operations before you finally output the results.
It's rather like doing arithmetic. When you do arithmetic, you add two numbers and you get another number, which you can then work on using more operators. The same is true of tables using relational operators.
Question: Getting back to the precision of the relational model -- why is that a benefit to the user?
Codd: If you examine DBMS's that are not based on some clearly defined mathematical foundation, you find that the behavior of their structures gives a lot of surprises.
Let me give you an example. When hierarchic DBMS were first developed, they had a nice, clean, single hierarchy structure. As such they were manageable.
But then it was decided that you can't get all of the company's data reasonably into one hierarchy and that there is no valid reason, for example, for subordinating parts to suppliers, or vice versa. Actually you need to deal with several hierarchies at once that you can associate with one another in various ways.
Later versions of the hierarchic DBMS got much more complicated because links were thrown into the previously simple tree structures in order to support this need for association. Programs started working on the structures before their behavior was precisely defined. The result was complexity and this was due to the lack of precision.
Question: So when you talk about "precision," you are referring not just to the data model, but to the manipulation capability as well?
Codd: Actually, the data model is more than just structure -- it also entails manipulation. So the answer is yes. If you were to talk only about structure, you leave open an infinity of possibilities in behavior.
Incidentally, I believe this is one reason why CODASYL and ANSI have not gotten any data base standard put together. They made an absolutely fatal mistake in 1973 when they split the data language into two separate committees -- one for data definition (DDL) and one for data manipulation (DML).
Question: Can you give an example of how this issue of precision affects the manipulation capabilities of the DBMS?
Codd: The major example is simply this: Until the relational data base concepts came along, people used to regard query capabilities as an 'add-on' to the DBMS.
All application programming was done a record-at-a-time in one language with one set of data structures, while all query was done in a different language with a different set of data structures. When the DBMS was designed, little or no attention was paid to the query capability and architecture.
By contrast, in the relational approach, at least the data structures seen by the application programmer and those seen by the end user are identical. Some relational systems have even closer integration of query with transaction processing. Such systems make an entire end user data language (with all of its associated optimizing) available to application programmers also. SQL/DS and INGRES are examples of such systems.
Question: So the relational model helped bridge this gap between application programming and query capabilities?
Codd: The relational model started with the question, "How are people who know nothing about programming going to access or address data?" Furthermore, if you could find an answer to that question we then asked why wouldn't the answer also be good for application programmers.
Look what you've got when you've taken this approach. Suddenly you've greatly enhanced the communication between end-users and programmers because they're talking not only about the same data structures, but also about the same manipulations that can be performed on those structures.
Question: So the precision of the relational model is evident in its unifying of various areas of DBMS implementation?
Codd: The relational model is a prescription for DBMS implementation to comply with -- and a fairly tight one at that. Implementation does not become a matter of someone deciding at the last minute that repeating groups, or links of this or that nature, are needed. It prevents someone from saying, for example, that we need some new operations to extract information for some user because he wants to get at non-key fields. With the relational model, all this is prescribed in advance and you can avoid the patches upon patches that characterize many of today's DBMS's.
... On Normalization
*/ ?>Question: Where does the concept of normalization fit in with the relational model?
Codd: Before the normalization ideas came about, people defined records simply by stringing together fields they needed for particular applications. I remember many years ago looking at some records in a system developed here at IBM to support internal operations -- they were huge in terms of the sheer number of fields contained. Worst of all there was no rhyme or reason for particular fields being together -- no semantic considerations anyway.
Normalization is an attempt to capture tiny bits of the semantics of information in a formal way -- that is, to try to suggest the rules by which fields should be put together into a single record. These rules are not ad hoc -- they depend on identifying which fields are dependent on others in a rigorous fashion. These ideas, of course, are applicable to some degree to data bases under any DBMS -- not just relational systems.
Question: So normalization is not part of the relational model per se?
Codd: No, I don't think of it as such. Normalization is an associated theory, but is not part of the relational model per se. Relational theory is much broader than the relational model alone.
Question: Could there be more than one relational theory?
Codd: Perhaps. There are advocates of a relational theory that limits you to relations of degree two -- that is, a binary relational model.
But right now I see relational theory as simply a body of theory to which many people are contributing in different ways. I don't see there being any sort of split into Theory A and Theory B at this time. Perhaps in the future but not now.
Question: Getting back to the concept of normalization, is it fair to say that its chief advantage is in yielding stability in the data base design?
Codd: I certainly believe that the normalization ideas, plus others, can contribute to a good deal more stability in data base definitions and structure. It should become less necessary to have reorganizations of the logical level.
But notice I said "less." There is no way you can get a definition of the data base that's going to be good for, say, manufacturing operations forever, because the environment may change. All we're trying to do is to get more stability.
Question: How do you assess the public reaction to the normalization concepts?
Codd: Well, there have been some strange reactions to it. First of all, DBA's and others who actually design data bases have not found it hard to use these concepts. Many people use normalization concepts regularly.
Second, computer scientists -- some computer scientists -- tend to look at normalization and say that, because it is a very formal recipe for information, it must be just a syntactic approach.
Now let me say that I don't view it as capturing anywhere near all the meaning that can be captured. But the point is that normalization takes a little bite of the data's meaning and casts it into a formally manipulable form. This doesn't mean it ceases to be semantic in nature, but just that it's more usable by people and understandable by machines than otherwise.
Question: Beyond normalization, where else might the semantic track lead?
Codd: There is a lot of semantic information that must be dealt with in the same rigorous fashion. The idea of "type hierarchies" is an example.
Suppose you have employees. You have certain information about employees in general, but then you have special information about employees who are engineers in particular. The same might be true for specialized technicians, secretaries, and so on. This would be a two-level type hierarchy.
To carry this a step further, employees might also be customers, stockholders, and so on. So there would be certain basic information kept about people, as employees, customers, and so on. Here you have a three-level type hierarchy.
Rules for manipulating such type hierarchies are important and imply the need for insertion and deletion rules. For example, if there is an entry for Jones as an engineer, there needs to be an entry for Jones as an employee. This is an example of a semantic notion that has been around for quite a while and has been getting quite a bit of attention in data base management lately....
Type hierarchies have been around in computer science for quite some time now -- they've actually been more prominent in artificial intelligence than in data base until recently.... In type hierarchies,... an object of any kind can be declared to be a subtype of an object of any other kind. The subtype automatically inherits all of the properties of the supertype. Thus, one may have subtypes of subtypes. There is a practical need for this....
... More on the Relational Approach
*/ ?>Question: In terms of accessing data, how does the relational model differ from traditional techniques?
Codd: The approach of the relational model was to discard all the traditional addressing concepts, in particular something that's been with us for a long, long time -- at least since plugboard days -- which is positiona1 addressing. The idea behind the relational model is to be able to address any item of data in the data base by means of a combination of table name, primary key value (which gets you down to a row), and then a column name, which gets you a particular data item.
None of that is position-oriented. Who cares whether the table "Parts" precedes the table "Suppliers"? Nobody. Who cares whether the entry for Part #3 precedes the entry for Part #20? Well, somebody might care on a report, but we can give him an output ordering capability for that. Who cares whether the column "employee age" comes before or after the column that identifies "sex"? Nobody, as far as I know.
Normally, when you're dealing with large quantities of data, you really don't care what position anything is in. With the relational model, you get down to this very simple, completely associative addressing approach.
Incidentally, this associative addressing capability is something that is often completely overlooked when people compare different data models. They just ignore all the operators and directly compare tables with hierarchies and networks. I've seen that happen over and over again, and it is a serious error.
Question: Given the current state-of-the-art in hardware and software -- and we have no innovations in storage technology yet -- do see any types of applications for which a relational system might not be well-suited?
Codd: The only area that I can see where a relational system might not be well-suited is where a customer has installed a non-relational system and has a lot of programs already written to run on it. Clearly, that customer has an investment, which he has to think twice about before abandoning.
Apart from that, I don't see any other restrictions at this time in applying relational systems. There's nothing inherent in the use of a high-level language -- especially if it is compiled rather than interpreted -- that prevents one from getting performance roughly equivalent to any non-relational system on comparable tasks.
Question: How about any limit to the number of concurrent users?
Codd: What are the limits on a non-relational system? Every particular implementation has its limits. What I'm saying is that you won't find any sudden peculiar limits because it's relational.
Question: Perhaps the key is the idea of "comparable tasks." You would have to have indexes or links on the fields most heavily used for joins, or else you will have serious problems with performance. Isn't this correct?
Codd: Right, you would very likely have to have comparable physical structures under the covers, so to speak, although there is debate even on that. There are many other ways to optimize structures and systems for performance without going to links. For various reasons -- including distributability -- many people would like to avoid them. The important point is that even if they are there, they are under the covers where they can't impact application programs if, for example, you decide to change or replace them.
Question: Can you sum up what you think the chief advantages of the relational approach are?
Codd: Yes, I think there are three: productivity, communicability (between different kinds of users), and distributability.
Question: Briefly, why do you say "distributability"?
Codd: Because you can decompose tables very flexibly -- chop them up horizontally or vertically -- and assign them to various nodes in a network.
Also, you've got the high-level language for expressing transactions that you send across the communication lines. This greatly reduces the number of bytes that shuttle back and forth.
In addition, you can readily decompose the data requests into sub-query components that are individually relevant to the various distributed sites. This decomposition capability is extremely important. It is something that you can't do readily with any non-relational system.
Question: Is there research going on now in this area?
Codd: Yes. In our laboratory we have a system under development wherein we are contemplating doing joins between data in London and data in New York, and we're figuring out the best way to do that. There are published papers on the underlying architecture of this research system, which is called R*.
Question: Would you care to make any predictions about when the relational approach will become the predominant commercial technology in the industry?
Codd: No, I'm not saying when. I do think the number of installations will exceed all the others in a decade or so.
... On Data Base Theory and Practice
*/ ?>Question: With regard to the work of CODASYL and the implementations it has produced, do you think that overall it has had a negative or a positive influence on data base theory and practice?
Codd: I believe there has been a positive contribution in the following sense. When you set up a completely new data base from scratch, I believe that the data structure diagrams -- sometimes called Bachman diagrams -- are a helpful tool. They do have limitations, but they can be useful in getting the data into a more organized form.
What happened, I believe, was that, because the I-D-S structure lay behind the CODASYL proposal, they took those links that were in the data structure diagram and cast them into I-D-S type links that had to be navigated by the programmer. It was that step that was retrogressive.
I have no quarrel with the use of links on paper in the preliminary stages of data base design, so long as you very carefully distinguish that from the embedding of those links into data structures that the application programmer must navigate.
Question: So links at data base design-time are OK?
Codd: Designers need to scratch around a lot before deciding how to organize the data base. As I mentioned before, normalization is only a small part of that. I have no objections to people drawing dependency graphs between entities on paper, but this linked representation is only a halfway house in achieving the final implementation.
Question: And what about that final implementation?
Codd: When you get to the relational representation of the data, then you declare your underlying domains. These are the value sets from which the columns get their values. The fact that a column in one table gets its values from the same domain as a column in another table expresses the fact that there exists a certain relationship -- namely that the tables can be joined.
Question: Doesn't this approach make using the data base somewhat more difficult than under a network system where the links are visibly defined?
Codd: No, and especially not with a relational system where you must explicitly define domains as in Query By Example. Unfortunately, SQL/DS does not currently support the domain concept as strongly as one might like. Nevertheless, even with SQL/DS, users do not appear to find joins difficult to formulate.
Why shouldn't the catalog, or directory if you like, for a relational system contain whatever further information of a semantic variety needed to show where you can couple this field with that? Actually, I have developed an extended relational model that has a considerably enriched catalog, which is about semantic connections -- logical links if you like -- between tables. This extended version also handles data type hierarchies and complex relationships -- interrelationships of interrelationships -- all without altering the basic tabular structure of the relational model but, instead, by having enriched entries in the catalog.
Question: Perhaps this issue of explicitly-defined domains is going to be more important to the larger, more integrated relational applications than to the single- user, start-up type applications?
Codd: Yes, as you get further into a more integrated data base you find that the ratio of the overall number of columns to the number of actual domains increases significantly. For example, in a financial data base, you're going to find very many columns defined on the 'dollar' domain.
Question: Isn't one strongpoint of the network approach that it encourages the user to think more specifically about data integration?
Codd: On the contrary, it is a weak point because it encourages the user to confine his thinking to a few pre-defined relationships. Moreover, if the initial design were wrong, the user may be locked in because of the application programs, which would be impacted if a change in network links were made. By contrast, domains are the glue that holds things together in the relational approach, and they do so without impairing distributability!
Domains have a lot more scope than links in the CODASYL model. Whenever two columns in a relational data base are defined on the same domain (whether the columns are from the same or different tables), there is an implied relationship, which users may exploit -- whenever they so desire -- using the relational operators. Such exploitation is in no way dependent on pre-defined network links.
Let me say one more thing about this. I believe that people starting from scratch become productive much faster with relational systems than they do under the CODASYL approach.
... Answering Critics
*/ ?>Question: Then why are relational systems finally appearing just now?
Codd: The best way for me to answer that is this: There are many people at IBM who are very concerned about protecting both IBM's and the customer's investment in IMS. It's my personal opinion -- and this is not necessarily the opinion of anybody else at IBM -- that IBM has been, and still is, overprotective.
What we should do is to put these things out in the marketplace and see what customers really want to do with them. Let the customers decide. If the customer has tremendous investment in IMS, then he should protect that investment. And IBM, I am sure, will help him protect that investment.
But if the customer makes the decision that he would be better off moving to a relational data base -- either in addition to, or instead of IMS -- then he should have that choice to make. ...But at some point in the future -- and nobody knows exactly when -- they'll probably begin finding the productivity, communicability, and distributability arguments to be so strong that the expense of a transition will he worth it.
Now IMS may go on forever in some installations, although forever is a rather long time! It's very hard to predict....
Question: What do you say to those critics who say that the relational theory is a nice academic body of work, but beyond its being suitable for the writing of academic papers, that it is not all that applicable to the real world?
Codd: I have two complementary answers for that: One is that it sounds a bit like sour grapes from those who lack the academic footing. The other is that the relational approach is becoming successful in the marketplace in spite of that alleged handicap. Look at Tandem -- the only DBMS they offer is a relational system. So I think this criticism is really vacuous.
... On the Future
*/ ?>Question: In data processing it seems like as soon as a new type of system is engineered and marketed it is already obsolete. Are you afraid this may also happen for relational systems?
Codd: The applicability of relational systems has hardly been explored yet. We've got a long way to go improving the relational systems we have and applying and marketing them. Moreover, I see nothing on the horizon about to replace the relational approach....
I believe there's such a large scope for relational systems that they will not become obsolete in a hurry. By the turn of the century we may have something in view that's a lot more powerful, I don't know -- but I'm sure it's still years away.
Question: One benefit of the relational languages is their boosting of productivity. In your own mind have we gone as far as we can with enhancing the processes by which applications are built?
Codd: Absolutely not. All that the relational techniques give you is a quick way of handling the shared data components of your application. You need a lot more than that to handle the other parts of your application.
For example, your application may involve things such as income tax rules. You would like to be able to put those rules either in the data base or perhaps in the application if they're peculiar to that particular application. You'd like to be able to modify the rules easily, but our present languages don't allow you to do this. That's just one example.
Question: Do you expect progress in this area?
Codd: Yes, definitely. I consider it to be complementary to relational data bases. For example, we can expect much better editors that know the languages we are using and that can trip us up if we're doing things that look senseless.
Question: With the recent release of SQL/DS and your Turing Award honor, some might be tempted to look upon this timeframe as a watershed in commercial relational technology. What's your own feeling on this matter?
Codd: I disagree. My feeling is that it's not really a watershed in relational matters, in either theory or implementation. We've got lots of scope, say for improving our methods of handling missing information. There's the whole issue of supporting multiple-valued logics. There's lots of room for development of theory and optimization methods for data that's distributed and redundant. I could go on and on with the list.
Question: So you see plenty of challenges still awaiting action?
Codd: Absolutely. I think a time will come when we will have much less concern for the physical aspects of the data base. In the past, those have been our main concerns.
Until the relational model came along, in fact, almost all our thinking was dominated by constructing efficient access paths for specific applications. I believe -- and it's almost foreseeable now -- that there will finally come a time when hardware, as well as a low level of software or firmware, will take that burden away from us once and for all.
References
[1] E.F. Codd, "Database Management Systems: DB2 and IMS ~ A Response to the Newsletter Interview with IBM's Lois Dimpfel," Data Base Newsletter, Vol. 15, Number 1 (January/February 1987), p. 15. [Copyright, 1986. E.F. Codd / The Relational Institute. All rights reserved.]
[2] "A Newsletter Interview with Edgar F. Codd," Data Base Newsletter, Vol. 10, Number 3 (May 1982), p. 4.
[3] E.F. Codd, "Why Choose a Relational DBMS?" Data Base Newsletter, Vol. 14, Number 2 (March/April 1986), pp. 12-13. [Copyright, 1986. E.F. Codd.]
[4] E.F. Codd, "Questions and Answers Concerning Relational Languages," Data Base Newsletter, Vol. 15, Number 4 (July/August 1987), p. 6. [Copyright, 1987. E.F. Codd and The Relational Institute. All rights reserved.]
[5] E.F. Codd, "Questions and Answers Concerning Relational Languages," Data Base Newsletter, Vol. 15, Number 4 (July/August 1987), p. 7. [Copyright, 1987. E.F. Codd and The Relational Institute. All rights reserved.]
[6] E.F. Codd, "Guest Say-so," Data Base Newsletter, Vol. 10, Number 4 (July 1982), p. 2.
[7] "A Newsletter Interview with Edgar F. Codd," Data Base Newsletter, Vol. 10, Number 2 (March 1982).
[8] "A Newsletter Interview with Edgar F. Codd," Data Base Newsletter, Vol. 10, Number 3 (May 1982).
# # #
About our Contributor:
Online Interactive Training Series
In response to a great many requests, Business Rule Solutions now offers at-a-distance learning options. No travel, no backlogs, no hassles. Same great instructors, but with schedules, content and pricing designed to meet the special needs of busy professionals.