Maarten's Sense of Data

This blog is created in order to stimulate discussion and the exchange of ideas about all sorts of subjects related to the fascinating world of data. I would like to use the blog for organizing my own thoughts and ideas, and for inviting other people who share an interest in the intriguing world of data to react on the postings. So please feel free to react and contribute! Disclaimer: This blog represents my own personal views, and not those of my employer or any third party.

Monday, April 16, 2007

If you want a Holy Grail

To anyone who feels addressed

If you want a Holy Grail
If you’re still looking for that one single

• application that will support all of your processes
• database that will hold all of your data
• communication standard that will empower all your communication channels
• language that everyone will speak exclusively in order to exterminate all miscommunication
• programming language that can efficiently describe all of your wishes
• modeling standard that all of your models will be stated in
• infrastructure product that will underlie all of your automated information processing
• technology that will make all your dreams come true
• theory that will shed a light on all the misty parts of reality
• good idea that will make all the bears on your path go away
• view on reality so that people will no longer disagree
• group of people that will carry out all of your work
• business model that will survive the next century

, or that one single

• vendor that will supply all the stuff you need,

then I suggest you keep your Big Search for the Holy Grail alive!

It’s like with the ‘real’ Holy Grail. People have been looking for centuries, but never found anything substantial…

The question is whether you can live without your Grail or not. If not, I wish you a lot of success. Well, it’s all about the way to the goal, right?

For those who didn’t yet get the idea: this posting is meant to bring the need for loose coupling as a harmonization method into the picture. Did I hear someone suggest ‘LC for Holy Grail!’? ;-)

Monday, March 19, 2007

CDM Myth 2: External Standards in your CDM

Some time ago an anonymous visitor showed interest in the relationship between a CDM and an external standard. I promised to come back to that subject. Here's my first attempt to try to explain this relationship.

CDM Myth 2: External Standards in your CDM
An often heard idea about the creation of a CDM is that you can add the contents of a relevant external standard in your CDM just like that, thereby efficiently creating a CDM with high quality content.
CDM’s and external data- or communication standards (abbreviated ‘ES-es’ in my terminology) are both collections of agreements, ’standards’ that can be applied in a loosely coupling environment. But there are important differences as well, as Table 1 indicates.

These differences require different appliances of the standards. The goal people had in mind when designing their external standards was to enhance the communication processes between organizations, not to standardize an organization internally. This goal is comparable to the one of the artificial language Esperanto. This language was not created as a replacement for any existing natural language in use at the time, but to offer people a second language to be used for communication between persons speaking different native languages.

Because of the fact that external standards are not focused on looking after the interests specific for your own organization, they shouldn’t be incorporated as an accepted fact in any organization- specific CDM without thorough consideration. The only correct way to use an ES for the construction of a CDM is by identifying interesting parts in it, screening these parts for desirability and for consistency with existing standards, guidelines and principles in your organization, and incorporating them as ‘borrowed ideas’ in your CDM, in accordance with the construction principles valid for this CDM. If done this way, an organization can keep her self determination and her own identity, and she can determine and realize her own competition strategy.

It’s important that the dependency between the ES and the CM be broken immediately after the borrowing of the ideas has been completed. The CDM should not be regarded as being ‘standardized’ by this external standard, it has only borrowed some useful ideas, that’s all. See Picture 1 for an illustration of the process. Keeping the dependency endangers the organization’s flexibility and self determination or sovereignty. Changes in the ES or a choice for another ES would then have to be carried through in the CDM as well. Moreover, your CDM could then be changed only after the change is adopted by the ES first, which is something no organization can force a standardization institute to accept.

Complications in the use of an ES as a CDM building block
In addition to this flexibility issue discussed so far, there are a few reasons to be very careful with the use of an ES as a CDM building block.

Mutual inconsistencies. There can easily be many inconsistencies between different external standards. This makes blindly incorporating an ES into a CDM a risky thing to do. One cause for such inconsistencies is the fact that ES-es often overlap. This might probably not happen very frequently with regard to their model of the primary domain they’re intended for, for example ‘transportation by train’ or ‘postal code representation’, but there’s another tricky property to many ES-es: they don’t only standardize on only one dimension. I’ve seen many standards that, in addition to standardizing the data involve in their domain, offered standards about many other things as well, like message structures, the language chosen to represent the messages in (XML, for example) and the precise way in which the messages should be modeled in this language (something like ‘best practices XML’). It’s not hard to see that these standards do not agree on all dimensions. To make things even more complicated, many organizations, like we do, probably have their own construction principles on some of these dimensions. A way out of this inconsistency web is to only borrow ideas from within one single dimension, ideas that do not cross any dimension borders, and incorporate only content that is consistent with the standards you happen to adhere to yourself.

Weak semantic standardization. A second complication pops up when an external standard is only weakly standardized with regard to data semantics, which actually means it is ambiguous. The result of this is that parties who adopted such a standard will likely use it in different ways. Although at first sight there seems to be nothing wrong, both parties using exactly the same technical message structure and the same symbols for representing their data, there may be a ‘hidden semantic problem’ at work. Semantic harmonization requires quite a thorough scrutiny of each parties’ precise semantics. This is not an easy thing to perform. So, although these external standards are called ‘standards’, it seems some of them are more standardized than others. I expect this to happen rather often. It seems at least wise to be very critical about the so-called ‘intended interpretations’ of the standardized messages involved, and also about the semantics of the data that are actually transmitted using these messages.

CDM Myth 1: The CDM as just a means to achieve loose coupling

In modern IT, people are very busy doing systems integration and also, albeit a little less still, with canonical data models (CDM’s). The demands that customers, regulatory organizations and business partners impose on businesses nowadays cry for effective communication processes, internal as well as external, and a flexible IT.

A CDM can help you meet these demands. A critical condition for success, however, is an appropriate application of the concept of a CDM. This posting and some other ones that are yet to come argue that the concept is often misunderstood, thereby leaving a chance for inappropriate use. This posting is the first of a series of socalled CDM Myths, that means to offer insights into the world of canonical data modeling, and to contribute to a discussion about how to correctly apply a CDM.

CDM Myth 1: The CDM as just a means to achieve loose coupling
In this earlier posting I argued that I thought it unwise to limit the use of a CDM to only one application. I added a short list of some applications of a CDM that we have in mind. This early posting covers about everything that we call ´Myth 1´. Because I plan to write extensively about the applications of a CDM in the near future, I will leave this first myth to what was said in the earlier posting.

Wednesday, January 17, 2007

Data Formats, Semantics and Meanings in a CDM

In earlier blog postings, I’ve argued that in order to enable powerful appliances for a canonical data model, it will have to contain semantic data descriptions next to the mere technical formatting descriptions that can be found in almost any data model. This idea has yet to become commonplace. One cause for our reluctance to incorporate semantic descriptions probably is the big melting pot of buzz words that seems to inevitably have to accompany each new idea in IT (SOA is another one!). The idea of a CDM comes with its own pot, of course, thereby creating a lot of confusion. This posting is meant to give some essential terms a place, hopefully increasing people’s understanding of the world of the CDM.

Data Formats, Semantics and Meanings in a CDM
This one melting pot contains, among others, such terms as data semantics, data meaning and data format. What’s their significance, and how should they be applied?

One important notion is, that data have a whole lot of characteristics. Therefore, you can find a number of different types of data models, each providing you with its own set of metadata, data describing your business data. You can, for example, think of the data’s data types, their field lengths (sounds familiar this far?) and their semantics, their owners and users, etcetera. A simple organizing model that you might find useful is the one below.

From this little model, it can be read that data
1. are communicated to perceivers
2. refer to things in a ‘world’
3. are represented (stored and/or presented) in information systems

These three aspects are fundamental to any type of data or data use. This view can shed light on the way our terms can be used best. For instance, each fundamental can be associated with one of our three buzz words. Here’s the same picture, now with the buzz words filled in.

So, according to this model, the perception of data is linked to data meaning, the data’s reference to a world is associated with data semantics, and data representations have something to do with data formatting. Let’s delve a little deeper into the nature of these associations.

Technical Formatting
This probably is the only aspect of data that most of us are familiar with. It results from the fact that data only ‘exist’ in our world if they are represented somewhere physically, in a computer system’s memory, on a piece of paper, in a brain or in whatever thing that can hold or represent data. In this light it’s not surprising that almost all data models we have are directed to this aspect of the data, giving insight to their data types, data masks, units of measurement, and the like.

Technical formatting can be divided into two parts: one to describe those data structure characteristics that are independent from the physical systems that actually represent or hold the data, and one that covers physical implementations in specific physical systems. All abovementioned formatting attributes are of the system-independent type. Examples of physical implementation attributes are endianism and byte word length.

It’s important to realize that formatting has no impact whatsoever on semantics. A specific semantics can be realized technically in a vast number of ways. As an example, metrical objects (a matter of semantics) can be realized using numerical data types, but also by applying non-numerical types like strings.

Semantics
It’s too bad that, unlike data formatting, data semantics are so invisible and intangible to us, because this is really what data are all about. Data semantics constitute a world of concepts (ideas, which we cannot see), rather than of symbols. Symbol juggling is probably due to this invisibility, giving rise to a number of problems related to the way we treat our data.

At this point, it’s important to have a good understanding of what data actually are. As you can make up from the symbol juggling posting, they are not the symbols that we see on our computer screens, for instance. Let’s define data as being statements about objects in a ‘world’, whether true or false. This ‘world’ can be either real or imaginative. The objects are whatever ‘things’ that are relevant to an intelligent actor in, or perceiver of that world. So, data always describe small parts of situations in that world. Data semantics, then, should be conceived of as the content of these statements, that what is being stated about that object in that world. This is the reference-to-things-in-a-world aspect of data.

Meaning
Just like data semantics, meaning in itself is invisible. Meaning is, in my view, defined roughly as ‘the consequence of the perception of an event’. Although I use this definition in a broader, more philosophical sense, in this discussion it’s restricted to the world of psychology: When you perceive an event that happens, you extract information (data) from it in the form of semantics, among other things like tone of voice or timing. Then, on the basis of all information extracted, your brain starts to construct meaning by activating inference processes. This meaning you construct is what the event means to you, now. Hence, meaning is extremely depending on the availability and accessibility of knowledge in an intelligent object (not necessarily a human being!) at a very specific moment in time. It’s clear that meaning understood in this way cannot be modeled formally, not even in the most sophisticated CDM we can think of. Still, this concept is relevant in the domain of the CDM, partly because it plays a role in communication processes. Next to that, it can be used to make clear that this is NOT what we’re trying to model in our CDM.

Conclusions
Now, to conclude, data formatting is what you describe in a technical data model (logical and maybe also physical, if need be), data semantics is what you clarify in a semantical (or ‘conceptual’) data model, and data meaning is what I hope you’ll never try to formalize in any kind of data model, but what you should keep in mind in order to create really effective communication processes. Although effective communication strongly leans on shared data semantics, it really is much more than that, resulting from the fact that its effectiveness is not so much measured in terms of a mutual understanding of the things stated as it is in terms of its effect on the listener being in accord with the speaker's intention. Communication is, after all, a means for an actor to indirectly create an effect in a perceiver.

By the way, for our CDM we’re using the model types briefly described in this posting.

Friday, January 12, 2007

I’m Starting to Feel Sick Now

It made me sad, you know. I recently read an article written by an accepted expert in the field, about ‘What Everyone Needs to Know about SOA’. The author wrote about the value of reusability of software functionality and described how we’ve been improving the way functional components can be ‘picked up’ for reuse.

A nice story indeed, but, given the header of the article, which promises to explain the major drawbacks of the use of Service Oriented Architecture, things everyone needs to know about, IMHO the author refuses to mention the possibly one most important issue left.

Even if you can really easily pick up your functional components and use them in a new context (and SOA might make this possible), if you don’t have a way to agree on exactly WHAT functionality you need and exactly WHAT functionality is offered, you will not be able to leverage the possibilities of reuse whatsoever.

The article I read gives a highly technical view on SOA, however the author added some catches, challenges and advice from a broader perspective. But still, no reference to data semantics, to the idea of WHAT data exactly to exchange. We see this happen in so many writings and presentations about systems integration in general and SOA in particular. And it makes me sad, because every expert in the field can at least see and understand the need for semantic agreement in any kind of collaboration, or am I wrong? For how long have we been connecting systems now?

Why then does almost everyone in the field neglect this subject completely? How can it be that an article about What Everyone Needs to Know about SOA lacks even a simple reference to this matter?

I accuse SOA vendors of making big promises while intentionally leaving their customers alone with this important issue. Not one of the vendors I spoke to could be of any significance here, for example by giving insight in what it’s all about, or by offering powerful semantic tooling functionality. So, I undertook this endeavor myself, not being a complete greenhorn in the matter myself. But this practice is not helping SOA to be accepted and leveraged!

I can’t state it often enough: Reuse of functionality and communication in general can be of any value only if both the HOW and the WHAT are taken into account. You will need semantic descriptions. Be prepared for this! That’s a big part of what this web log is about.

You know, in the Dutch language, ‘SOA’ also stands for the range of sexually transmittable diseases. And being in this world of SOA for some time now, I am starting to feel bad now…

If you want to read the specific article yourself, it’s located
here.

Friday, December 22, 2006

Bridging Time and Space needs Semantic Description

Recently I was occupying myself with the questions of why do we store our data, what kind of data storages do we have available these days (see Where do you live and what do you do? -Why, does it matter?), and what would be the consequences of doing this. I wasn’t only thinking of storage places like databases or more fluid memory structures. Messages would fit the idea just as well, or paper forms.

A teacher of mine once taught me that data registrations are meant to bridge time and space. Without them, our data would be vanished before we knew it. We could only pass our data between its users in a very immediate fashion, like throwing a hot potato that was not to fall on the ground.

I realized that our data need not be hold somewhere only if we can reliably deliver the data created by some process to the processes that need them, immediately after the moment they are created, and the receiving processes at their turn can use the data in their execution immediately thereafter. If we can’t do all of this, we certainly need data storages. Big thing, right?

This would be like calling a function or procedure in a computer program with a number of parameters. If the data in the parameters are of no use at all except for these direct function calls, then there is no need for storing them anywhere. We would never want to retrieve them then, would we?

In real situations outside the realms of computer program, this rarely occurs. Data are often needed for a number of different processes, some of which may not even be known at the time the data were created. Business Intelligence processes are an example of this. Can you foresee in advance what ad hoc analysis and reporting to expect for the next few years, or even months, or weeks? I certainly can’t!

Moreover, it’s common to have processes that can’t be executed immediately after the data it needs become available. Sometimes, a process needs data from multiple sources that do not create them simultaneously, for they don’t run at the same time.

So, we indeed need to be able to bridge time. We need to remember facts that are discovered in some process, in order to be able to feed them as data to one or more other processes at a later moment in time. We often do this with structured data, but it is no different with unstructured data. Think of any kind of writings like e-mails, internet pages, text documents, and so on. And text strings often should really be thought of as having unstructured content as well, shouldn’t they? And all of these things get stored for later use.

Secondly, not all processes can be executed on the same physical spot. Hence the need for being able to bridge space with our data stores. We need to transmit our data between physical places.

I believe most people are willing to agree with me that bridging space is a form of communication. But what to say about time? Imagine we only want to use our data later on in the very same process that created them, in the same physical place, so, only bridging time, not space. Couldn’t we say we were communicating to ourselves then? Communicating to our future selves, so to speak? Or is this a strange thought coming from a weird mind? Should we say, maybe, that this would not be ‘one process’ then?

My conclusion of all this is, that data registrations are always a matter of communication. Data registrations are used for communicating data between two communicators, likely bridging time and space, or between two different roles of one single communicator, at least bridging time, but maybe space as well (where will you be next week?).

Let me come to my point now. I guess you are aware of the dangers of a misinterpretation of data that are exchanged in any kind of communication. If you are, wouldn’t you agree that whenever time bridging or space bridging or both of them could be involved for a current, planned or not even anticipated application of a data storage place, some mechanism should be provided to support any possible data user, regardless of the applications she has in mind, in making the right interpretations?

Well, I know that different situations may not all require the same ambitious solution, but at least when different users or applications are identified, planned to be identified, or might be thought of to be identified, or when a user wants to use, plans to use or could think of using her data minimally, say, one week after creation, I recommend describing the semantics of the data in question formally, in a conceptual data model. This should become ‘good practice’ to anyone in the business. It shouldn’t be IT where this desire comes from. In most cases it still does, however, probably because for some reason some IT- people first gained the insights. Have you seen a book on the relevance of data semantics on the business bookshelves lately? Well, here’s a classic, a very good start, although not coming from the business bookshelf: Data and Reality, by William Kent, second edition (2000, ISBN 1-58500-970-9). It was originally written in the seventies of the 20th century, but it’s probably more relevant than ever before. Reading chapter 1, for instance, which I actually did twice, almost made my eyes tear. Not because of a romantic plot in the book, but because of the worrisome situation in Data Land described in this book, directing us to the work that lies ahead.

You may not be able to think of it right now, but sooner or later, someone will come up with an interesting new application for your data. Or you yourself will throw a hungry look at someone else’s data. We are only beginning to experience what data can do for us, whether it be our own or someone else’s. Marvellous world, isn’t it!?

Wednesday, November 22, 2006

The Fight for the Terms

In attempting to improve communication in your organization, you will very likely meet the Semantic Problem: people and automated systems often have communication problems caused by the fact that they interpret messages in different ways. One aspect of this problem is that the terms that are used in the messages may have multiple interpretations. In trying to get rid of this so-called ambiguity, people often suggest to standardize the use of terminology. In my view, there are some serious weaknesses to this approach, and I suggest an alternative.

The Fight for the Terms
There are a lot less terms available for naming our concepts than there are concepts that we use. The result of this shortage of terms is that they are used to denote multiple concepts, so they become ambiguous. These terms are called homonyms: terms that can be assigned more than one interpretation (concepts). This ambiguity of terms can be a very nasty cause of communication problems in general. The solution to these problems is to ‘harmonize every existing disharmony’: remove existing differences or bridge them.

Everyone wants to use simple, recognizable, clear terminology for the naming of their concepts. This need for clear, short and easily understandable terms, in combination with the lack of terms, can easily lead to what I call The Fight for the Terms.

However, this fight need not be. It’s only in certain environments that it will occur. These environments are characterized by a strong belief in standardization. Standardizing the use of terms means to say “Well, in our organization, this term X is the name for the concept Y, and for nothing else.” By doing this, you indeed disambiguate the term, at least within the context of your organization, for you are promoting one interpretation for the term, while forbidding or neglecting every other one. But, as a direct consequence, people who happen to depend on any of the other concepts that used to be named by this standardized term are left on their own. They will have to find another term, which will very likely not be that clear and simple, or even create a description instead. This is undesirable. Moreover, it would be an unacceptably naïve and simplistic idea to assume that these other concepts are of no value to the organization, or that there is no legitimacy for their existence, and hence they should not be used. So what to do?

For solving communication problems caused by ambiguity, we need not rely on standardization alone. What we need to do is to harmonize the use of terms. There exist three different methods for harmonization in general (see figure 1). Standardization is only one of them, and it’s not appropriate in all situations. Still, I find that people often suggest this method as the way to go. Table 1 summarizes some pros and cons of standardization and loose coupling.

I believe we need some sort of mind shift here. Instead of being allergic to differences, trying to standardize everything like a person with a dirt obsession would continuously clean things, we should embrace the idea of local freedom to enable local optimization to local concerns, combined with global harmonization to put things together. This is what loose coupling is about. It results in flexibility, and it prevents the people involved to feel threatened and create roadblocks against change.

Standardization as a means to create harmony should not be put aside, however, because in certain cases, it remains a good and practical solution. We must learn to understand which harmonization method to use best in any specific type of situation. But, IMHO, the attitude should be that ‘it’s loose coupling, unless…’

Standardization of terms should be regarded as a challenge for a long term trajectory (anyone noticed an ambiguity here?) aside from the regular work that is best supported by the loose coupling mechanism. A canonical data model (CDM) is a means well suited to support such a trajectory, because it can take your standards as you create them, bit-by-bit, without effecting the communication partners internally. In addition to that, it can start to fulfill a number of other appliances like a data dictionary to support data reuse or a thermometer to provide you with an insight in the quality of your business language.

So, let’s not fight but co-operate, leaving each other the freedom needed for good performance.

Friday, September 29, 2006

EDA: are you polling for states or are you interrupted by events?

At NS, we are preparing our organization for Event Driven Architecture, or EDA. There’s a growing interest for this subject in the literature, but there still are some remaining discussions. One of them is about what it is that we should be interested in, having adopted EDA as a main approach to our design of and thinking about information processes and the automated systems meant to support them.

EDA: are you polling for states or are you interrupted by events?

There seems to exist an obscurity in the realm of Event Driven Architecture, or EDA. What are the building blocks of EDA? What should we be interested in, if we enter this event driven world? Are they states in our environment or are they things called ‘events’? Or are these two terms referring to the same thing?

It has been common practice for a long time to let states be the things that drive systems to act or react. People and other systems are oftentimes perceived to be reacting to states in their environment, once noticed, of course. If the traffic light is green (which is a state), you drive, right? And if you notice that a customer has ordered something (yet another state), you deliver. States inform you about appropriate actions to perform.

But there’s a drawback to the use of these states. Though states can tell you what to do, they don’t tell you when to react. Are other drivers blowing their horns because the traffic light has been green for a while before you accelerate? Did your customer turn to your competitor because you didn’t react quickly enough to her order?

If interpreted appropriately, events imply conditions in the environment as well as these states do. But events offer a bit more than just states. They include information about timeliness as well. It shouldn’t be your perception of a state in your environment that you react to or act upon. You should be reacting to perceived events. That’s what EDA is about, IMHO. You needn’t be polling your environment for states of interest to you. Instead, you should let yourself be interrupted by events in your environment.

This story implies, however, that, in order to fully leverage the events you perceive, you need not only understand what these events themselves are about. You also need to understand events in their context, in terms of their implications. You must be able to infer the conditions they create. You must form an idea about what these events mean to you. This way, events inform you about what happened, when it happened, and what state resulted, thereby enabling you to select the appropriate reactions AND the timeliness needed to be agile, in other words: they give the what and the when.

So if you ask me, I would be much more interested in events than in states, because events are more informative, providing you with the triggering logic you need to be agile.

This view comes with a consistent naming convention for events. I hope to present you with an insight into this naming convention soon.

Tuesday, September 12, 2006

Are you a Symbol Juggler?

People in organizations say they are convinced that their data are their primary business asset, or at least, a very important one. When examining the way people use this asset, a very different picture pops up. Saying your data is important is something completely different from acting according to that insight. One of the roadblocks that need to be taken is to get rid of the symbol juggling practice you can find in all organizations.

Are you a Symbol Juggler?
There’s something tricky about data. What you see of them, as it happens, is not what they are all about. The only things you really see are symbols that represent the data in a certain context, presented on a monitor or on a piece of paper, or still some other media.

But the data in fact ARE no symbols. They are propositions about objects you perceive in your reality. Data describe a world, whether it is real or fictional. It’s just unfortunate that we need symbols for visualizing these descriptions.

The way we use symbols to represent and present data to a perceiver may easily lead you astray. It tends to make you believe that you’re only acting upon some values (the symbols!) that are arranged in some way and in a specific context, like this table in this database. The idea that you are actually manipulating descriptions of some reality, descriptions that are supposed to be true, often doesn’t come to mind.

However presentations with large sets of data can be quite helpful to people now and then, for example for quick searches when analyzing troublesome data, they prevent us from constructing the semantics of the data in our heads. Our brains are just not powerful enough to handle large amounts of information with a deep understanding, especially when handling a data set over a significant period of time.

The danger of all this is that people using the data might make wrong decisions that don’t match reality or are not appropriate for the real situation at hand.

If you symbol-juggle instead of handle semantics carefully, you’re simply not aware of what you’re actually doing, and you might harm reality by describing a world that is not correct. As a consequence of this, you or someone else might end up making the wrong decisions.

People who do this are either mindless symbol jugglers who don’t know what they are doing, or evil manipulators who know it all too well! In my opinion, these evil types belong in jail, whereas the mindless jugglers should take a course on a topic like Data Awareness!

But that probably isn’t a complete solution to this problem. I think we can do better. Data Awareness can also be good advice for the ones who design for user interfaces. I’m sure design guidelines can be found that can help improve human users to correctly handle the information that systems present to them. Any ideas?

I believe certain sorts of data maltreatments could be prevented if only people improved their sense of data and our systems were adapted to this juggling tendency. Because if there is one lesson to be learned from decades of automation, it would be that oftentimes creative users act according to a rule like this one:

IF I CAN DO THIS TRICK, I WILL DO THIS TRICK, REGARDLESS OF WHETHER OR NOT I SHOULD DO THIS TRICK!

Hey, she's got to do her work!

Yeah, that’s our juggler!

Tuesday, September 05, 2006

The Architect

In IT, we are more and more aware of the need for flexibility. In the systems integration domain, this is one of our main concepts. Sometimes there are very simple techniques to achieve a goal. One of them, often used but not often enough, is the subject of this posting.

The Architect
An architect, whatever sort of, is a remarkable person. She has a number of powerful tools at her disposal. One of her tools that intrigue me the most is her ability to distinguish between things that, in her eyes, are different in nature and have different concerns. She can then conclude that you should separated these assets, and she can advise you as to how you should relate the separated assets, in order to enable you to continue your work, be it better than before, of course.

This ability is highly underestimated. In my opinion, however, it might be her primary trick. The way I think of an architect is of a person who brings order, by preventing people who are too close to reality to be able to overlook it, to throw everything of interest to them on one big heap.

Some people aren’t very good at considering what exactly they are doing and how, and if you are doing the actual work in any kind of domain, you tend to put everything you need within close range, especially when you don’t find the time now and then to sit back and reorganize. If you can’t do this, for whatever what reason, you need an architect!

In my job, I frequently feel the need to separate. It has become a second nature to me. Am I becoming an architect? Not so long ago, I didn’t have so much affinity with these extraordinary wise people, who lived on a cloud high above me, far out of reach. But this urge of mine to separate-and-relate makes me see them quite differently now, it feels like an elevator is bringing me closer to the heavens of the gods.

Some things REALLY ought to be separated. If not, they get you into trouble. You can probably think of some. Would you feel comfortable in a house without any internal walls to divide your toilet from your living room? How about a city with an airport next to your favorite restaurant? Do you keep your financial administration paperwork between your novels on the same bookshelf? Maybe you once did, as a student, and your father took the role of an architect, unasked, for your sake…

Though good architects can advise you beforehand to prevent you getting into trouble, it’s sad to see that they are most often consulted only when you’ve already been plagued for a long time.

Information Technology has its own nasty plagues. But it seems to be very difficult to get a grip on them, and make them explicit. Even more difficult things get when you start looking for their causes. Sometimes it’s better to take an architect’s intuition and indicate a solution-for-almost-everything. I know of such a medicine. I would recommend it to everyone in IT, especially those who have a say in this. So, if you’re experiencing any feelings of discomfort in your work, try this:

SEPARATE CONCEPTS FROM THEIR TECHNICAL REALIZATIONS!

I’m glad I did this while designing our data modeling world. I thought it wise to separate the conceptual aspects of data from the technical ones. And of course, wherever you separate, you should relate, or should I say re-relate. This is how the concept of the corridor is invented. Or, more close to IT, the concept of middleware. In our CDM, this re-relation is done by our third type of data model: The Realization Model (see also the picture in Where do you live and what do you do? -Why, does it matter?). Do you know of any other kind of re-relaters, things specialized in exactly this function? You know, the world is full of them, thanks to our friends on the clouds.

Friday, September 01, 2006

Where do you live and what do you do? –Why, does it matter?

Many organizations keep multiple data models, each one specialized for a specific use. These models are hardly maintained in harmony. This situation is unwanted: it results in a large number of overlapping but inconsistent models, while maintenance of these models is costly. This posting is intended to put forward a CDM as a possible solution to this problem. At least, it makes a start.

Where do you live and what do you do? –Why, does it matter?
Data live in many different houses. It’s where we put them: in external memories (in database tables or in data files) for data storage and retrieval, in message fields for data exchange, in user interface windows and on paper printouts for data presentation to human users, in microprocessor registers for direct data processing and in internal memory structures for indirect data processing. And I don’t pretend to be exhaustive here.

So, we can do many different things with data, and as a consequence of that, we put them in different places. For anything we can do with our data, one place is more suitable than another. Processing data directly from a hard drive? Better not. Storing data in internal memory? Risky. And automated systems read data from a piece of paper reliably and efficiently only with a lot of effort.

What’s the constant factor in this story? It’s the data themselves. Whatever we intend to do with them, wherever we put them, they still remain the same data. At least, their semantics remain the same. It’s only after some processing has been carried out on the data, that we can expect a change in semantics in the results.

There are, however, some aspects to our data that may very well be adapted to what we want to do with them or to where we want to put them. These aspects have nothing to do with data semantics. Instead, they are the citizens of the technical data model, describing the technical formats of the data. Data oftentimes take on a different format when moving to another house, just like you would put on a jacket when leaving home for work.

So where is all this leading to? Well, this story contributed to the global design of our world of canonical data modeling. We make an explicit distinction between semantics and technical formats. We use different model types for them. Semantics are described in a conceptual model, whereas descriptions of technical formats can be found in a technical model. Because of the fact that we model data living in a large number of houses, even houses of different types, we create multiple technical models.

But what about our conceptual models? Well, there is only one. For the data we want to describe, it doesn’t matter where they are, or what we intend to do with them. This one conceptual model supports many of our Applications of a CDM. For example, it eases correct data mapping (see app 2), and it gives a practical data catalogue to search for data living anywhere (see app 3).

Oh yes, there’s a third type of model involved, one that relates the semantics to the technical formats. It describes the relationships between the elements in the technical models and the elements in the conceptual model.

Your data systems, whether it be a business application, a database, a data warehouse, a message type or whatever thing that your data can live in, should all have their own technical data model, all linked to your one-and-only CDM! This will help reducing cost and it will give your data models more value.

Friday, August 25, 2006

Applications of a Canonical Data Model

Little more than a year ago, NS (Dutch Railway Company) started initiatives for the implementation of a better, up-to-date system integration platform. In the beginning we were mainly exploring the realm of Enterprise Application Integration (EAI). Because IT -flexibility was one of the main goals back then (and this one still is!), we were also aiming at the use of a canonical data model (CDM) in order to loosely couple our applications on the data aspect. When time passed and system integration articles heaped up on our desks (and throats got soar!), we became able to soak off the idea of a CDM from its original use in a loose coupling mechanism. This opened doors to new worlds. New ways to apply a CDM popped up, a better understanding of its nature arose. This second posting gives a brief overview of some of our results…

Applications of a CDM
The big insight was, at least to me, that a CDM is a data model, and although it’s a rather special type of data model, it’s NOT limited to a specific use. In our early enthusiastic days of systems integration, we learned from all sorts of literature, whether in books or on the Internet, that a CDM was a data model to be used as an intermediate data standard to loosely couple applications. A picture like the one below is probably well known to you (if you’re an integration-idiot like we are!). There is IMHO a lot more to gain from using a CDM in your organization! And, BTW, would you like to create and maintain data models specialized for each type of application you’re interested in? Think about systems integration, business intelligence, application development and the like… Probably not! And what to say about trying to maintain these multiple, overlapping models in concordance! Good Heavens no!

Picture: Vision of a CDM*, taken from Enterprise Service Bus by David Chappell (great book!), ISBN 0-596-00675-6 by O’Reilly
* In this picture, the CDM can be found in what is called the Canonical Message Format, because, as far as my knowledge goes, most companies using a message format this way do the data modeling for this message format in the message format itself, mostly within some XSD- Schema document, and not in a separate data model like we do. So, in those cases, this canonical message format plays the role of a CDM as well as the role of a canonical message format proper (Canonical Message Model or CMM in our terminology).

It is my conviction that you should develop and maintain your CDM to support a large number of possible applications. It will be hard enough to create only ONE enterprise wide model!

When you work this one out a little bit further, you will probably find the requirements for these applications to be very much alike. It’s ‘just’ proper data modeling that forms the basis, and a few types of metadata will have to be added for specific use. I will publish more on this later. So, here’s my list:

If set up appropriately, a CDM can be used as
1. an intermediate data standard to achieve loose coupling on the data aspect
2. a commonly accepted business language for improving any kind of communication process
3. a data catalogue that supports data sharing
4. a data model catalogue to improve reuse of data models
5. a thermometer in your IT- environment to find weaknesses and opportunities to improve your IT
6. a tracing tool for making impact analyses
7. a link between the data aspect and the organizational aspect of your organization to help set up registrations for authorization, data ownership, data maintenance and so on

I plan to further discuss these applications with you in the near future. However some of them may sound a little exotic, the first three particularly are already of high interest to us right now.

Thursday, August 24, 2006

Definition of Canonical Data Model

As a way of introduction, this first posting will give you a rough insight into my vision of what a canonical data model is (or a ‘CDM’ for short).

Definition of a CDM
As I see it, and like its name would suggest, a CDM is a data model. Hence, it should give a vision, in data modeling terms, on a specific domain of interest. It could, for example, list

· what kind of things are perceived relevant to the domain at hand (and how we call and define them)
· what sorts of information about these things are of interest to the domain
· how and where these information types are represented in our IT systems

In addition to this, a CDM might tell you, among other things
· who in your organization should be allowed to do what with these types (or who should not!)

The exact content of a CDM is of course strongly influenced by the applications you have in mind (see Applications of a CDM), but it seems to me that the above mentioned information is kind of basic.

However, not all data models that contain all of this information about a domain are a CDM. The addition of the term ‘canonical’ indicates that this model has a special status within an organization, or at least, it is expected to have such status. Whether it actually has is quite something else ;-)

This special status holds that this model, or more specifically its content, is to a large degree accepted as a common data standard within the organization. There is however a lot of pitfalls to this simple statement, enough for its own series of blog postings and plenty of discussion.

For now it suits me to state that a CDM is a data model that is intended to be as commonly accepted as possible at any moment in time, within the limits of an accepted level of ambition and the assets available (for instance: time).

This simply means that your canonical model will not only grow in size, as time progresses, but also in quality and ‘acceptability as a common standard’. It seems very unrealistic to me to expect any model to be commonly accepted from its conception, not even a small model.

Lastly, IMHO, a CDM is intended to eventually cover the complete organization, and to become an enterprise wide data model.