Wednesday, January 17, 2007

Data Formats, Semantics and Meanings in a CDM

In earlier blog postings, I’ve argued that in order to enable powerful appliances for a canonical data model, it will have to contain semantic data descriptions next to the mere technical formatting descriptions that can be found in almost any data model. This idea has yet to become commonplace. One cause for our reluctance to incorporate semantic descriptions probably is the big melting pot of buzz words that seems to inevitably have to accompany each new idea in IT (SOA is another one!). The idea of a CDM comes with its own pot, of course, thereby creating a lot of confusion. This posting is meant to give some essential terms a place, hopefully increasing people’s understanding of the world of the CDM.

Data Formats, Semantics and Meanings in a CDM
This one melting pot contains, among others, such terms as data semantics, data meaning and data format. What’s their significance, and how should they be applied?

One important notion is, that data have a whole lot of characteristics. Therefore, you can find a number of different types of data models, each providing you with its own set of metadata, data describing your business data. You can, for example, think of the data’s data types, their field lengths (sounds familiar this far?) and their semantics, their owners and users, etcetera. A simple organizing model that you might find useful is the one below.


From this little model, it can be read that data
1. are communicated to perceivers
2. refer to things in a ‘world’
3. are represented (stored and/or presented) in information systems

These three aspects are fundamental to any type of data or data use. This view can shed light on the way our terms can be used best. For instance, each fundamental can be associated with one of our three buzz words. Here’s the same picture, now with the buzz words filled in.


So, according to this model, the perception of data is linked to data meaning, the data’s reference to a world is associated with data semantics, and data representations have something to do with data formatting. Let’s delve a little deeper into the nature of these associations.

Technical Formatting
This probably is the only aspect of data that most of us are familiar with. It results from the fact that data only ‘exist’ in our world if they are represented somewhere physically, in a computer system’s memory, on a piece of paper, in a brain or in whatever thing that can hold or represent data. In this light it’s not surprising that almost all data models we have are directed to this aspect of the data, giving insight to their data types, data masks, units of measurement, and the like.

Technical formatting can be divided into two parts: one to describe those data structure characteristics that are independent from the physical systems that actually represent or hold the data, and one that covers physical implementations in specific physical systems. All abovementioned formatting attributes are of the system-independent type. Examples of physical implementation attributes are endianism and byte word length.

It’s important to realize that formatting has no impact whatsoever on semantics. A specific semantics can be realized technically in a vast number of ways. As an example, metrical objects (a matter of semantics) can be realized using numerical data types, but also by applying non-numerical types like strings.

Semantics
It’s too bad that, unlike data formatting, data semantics are so invisible and intangible to us, because this is really what data are all about. Data semantics constitute a world of concepts (ideas, which we cannot see), rather than of symbols. Symbol juggling is probably due to this invisibility, giving rise to a number of problems related to the way we treat our data.

At this point, it’s important to have a good understanding of what data actually are. As you can make up from the symbol juggling posting, they are not the symbols that we see on our computer screens, for instance. Let’s define data as being statements about objects in a ‘world’, whether true or false. This ‘world’ can be either real or imaginative. The objects are whatever ‘things’ that are relevant to an intelligent actor in, or perceiver of that world. So, data always describe small parts of situations in that world. Data semantics, then, should be conceived of as the content of these statements, that what is being stated about that object in that world. This is the reference-to-things-in-a-world aspect of data.

Meaning
Just like data semantics, meaning in itself is invisible. Meaning is, in my view, defined roughly as ‘the consequence of the perception of an event’. Although I use this definition in a broader, more philosophical sense, in this discussion it’s restricted to the world of psychology: When you perceive an event that happens, you extract information (data) from it in the form of semantics, among other things like tone of voice or timing. Then, on the basis of all information extracted, your brain starts to construct meaning by activating inference processes. This meaning you construct is what the event means to you, now. Hence, meaning is extremely depending on the availability and accessibility of knowledge in an intelligent object (not necessarily a human being!) at a very specific moment in time. It’s clear that meaning understood in this way cannot be modeled formally, not even in the most sophisticated CDM we can think of. Still, this concept is relevant in the domain of the CDM, partly because it plays a role in communication processes. Next to that, it can be used to make clear that this is NOT what we’re trying to model in our CDM.

Conclusions
Now, to conclude, data formatting is what you describe in a technical data model (logical and maybe also physical, if need be), data semantics is what you clarify in a semantical (or ‘conceptual’) data model, and data meaning is what I hope you’ll never try to formalize in any kind of data model, but what you should keep in mind in order to create really effective communication processes. Although effective communication strongly leans on shared data semantics, it really is much more than that, resulting from the fact that its effectiveness is not so much measured in terms of a mutual understanding of the things stated as it is in terms of its effect on the listener being in accord with the speaker's intention. Communication is, after all, a means for an actor to indirectly create an effect in a perceiver.

By the way, for our CDM we’re using the model types briefly described in this posting.

Friday, January 12, 2007

I’m Starting to Feel Sick Now

It made me sad, you know. I recently read an article written by an accepted expert in the field, about ‘What Everyone Needs to Know about SOA’. The author wrote about the value of reusability of software functionality and described how we’ve been improving the way functional components can be ‘picked up’ for reuse.

A nice story indeed, but, given the header of the article, which promises to explain the major drawbacks of the use of Service Oriented Architecture, things everyone needs to know about, IMHO the author refuses to mention the possibly one most important issue left.

Even if you can really easily pick up your functional components and use them in a new context (and SOA might make this possible), if you don’t have a way to agree on exactly WHAT functionality you need and exactly WHAT functionality is offered, you will not be able to leverage the possibilities of reuse whatsoever.

The article I read gives a highly technical view on SOA, however the author added some catches, challenges and advice from a broader perspective. But still, no reference to data semantics, to the idea of WHAT data exactly to exchange. We see this happen in so many writings and presentations about systems integration in general and SOA in particular. And it makes me sad, because every expert in the field can at least see and understand the need for semantic agreement in any kind of collaboration, or am I wrong? For how long have we been connecting systems now?

Why then does almost everyone in the field neglect this subject completely? How can it be that an article about What Everyone Needs to Know about SOA lacks even a simple reference to this matter?

I accuse SOA vendors of making big promises while intentionally leaving their customers alone with this important issue. Not one of the vendors I spoke to could be of any significance here, for example by giving insight in what it’s all about, or by offering powerful semantic tooling functionality. So, I undertook this endeavor myself, not being a complete greenhorn in the matter myself. But this practice is not helping SOA to be accepted and leveraged!

I can’t state it often enough: Reuse of functionality and communication in general can be of any value only if both the HOW and the WHAT are taken into account. You will need semantic descriptions. Be prepared for this! That’s a big part of what this web log is about.

You know, in the Dutch language, ‘SOA’ also stands for the range of sexually transmittable diseases. And being in this world of SOA for some time now, I am starting to feel bad now…

If you want to read the specific article yourself, it’s located
here.