The Benefits of Separating Content and Semantics

Note: The following article assumes you understand HTML5 and CSS3.

Your multilingual web site is ready to go in 49 languages. Wait, an article has an h3 element that should be an h2. That’ll be 49 html changes to make. This just in, a new image for devices with a max-width of 480px wide is ready. That’s only one CSS change to make. Too bad the latest responsive design techniques will slow everything down as a third picture gets downloaded even though only one will get displayed per device.

I've been pondering the above multilingual site situation since html 3.2 in 1997 and the responsive design issue for about two years. I believe the best way to solve these issues is to separate the content from the semantics.

 

What’s Content and What’s Semantics

<h1>Main Heading</h1>
<h2>Sub Heading</h2>
<p>This is some text on the page.</p>
<p>
   <img src="/funnycat.jpg" alt="Text that some browsers convey to help their users" />
</p>

In this example the h1, h2, p, and img elements provide the semantics. Semantics tell the browser, and search engines, what type of content is being conveyed to the user. Everything inside the h1, h2, and p elements is content. The values of the img element attributes src and alt are also content. Content is what we are conveying to the user.

We can clearly see above how the content and semantics are weaved together. No matter how we prettify our code with indenting and text coloring, to help us see each part separately, they remain woven together.

In plain html web sites the separation of content by language typically happens by organizing each languages page into separate directories. An article will have duplicate html in each language directory with the content changes for that language. Keeping the web site consistent means that an html change in one page means having to make the same html change to the articles’ alternative language pages.

My search for a CMS that separated content and semantics at the html level turned up nothing. A CMS may have a field to record what language a page is written it. Some can link an article of one language directly to the same article in another language. But, within their databases they still maintain the content and semantics woven together. Even when using a CMS changing an unordered list to an ordered list requires going to that article in each language to change the html. A five language web site has five articles to change.

Separation of content by device is done with CSS using media queries and/or javascript. This works well for separating layout and style from the semantics. However, there are limitations for content. The same content, woven into the semantics, is served to each device. CSS and/or javascript can then hide or resize some of the content from the user. This hidden or resized content takes up valuable bandwidth and time, but goes unseen. As users continue to move to mobile devices, which generally have slower connection speeds, it’s becoming more important to keep bandwidth use to a minimum.

We need to support alternate versions of all content types (ie.; text, images, video, tables, etc...) based on language queries, media queries and both. Different languages need to show different text, but you may also want to show different article/paragraph lengths to different display sizes of the same language. Images and videos not only need to be the right size but those with text or speech also need to contain the right language.

A Proposed Method to Separate Content and Semantics

<html lang="en">

This current html attribute indicates the language of the current document, in this case English, and will provide compatibility with current browsers. Once all major browsers support this proposal the lang attribute could be left out and the page would have no content in it.

@media all and (max-width: 480px) and @language fr:(catalog: href="/catalogs/fr/french.cat") {
   article {
      padding: 5px;
   }

The above query example combines both a media query and a language query but you could do either query on their own. Notice the language query that links to a catalog file containing the French language content. More about catalog files in a moment. The current method of using CSS media queries remains unchanged.

<article entry="example">
   <h1>English title</h1>
   <p>
      <img class="phone" src="/example_en_small.png" alt="Waist up picture of person holding an English sign" />
      <img class="tablet" src="/example_en_medium.png" alt="Full body picture of person holding an English sign" />
      <img class="screen" src="/example_en_large.png" alt="Picture of office with person holding an English sign" />
   </p>
   <p>Current browsers will show this English content. Notice, there are three img elements. All three images are downloaded and CSS media quires use the class attribute to hide two images and leave only the proper sized image visible. Browsers that understand the new code will replace this content with the appropriate alternative content.</p>
   <p>HTML elements are allowed inside the catalog file. This gives the developer the choice to replace the entire article element, as done here, or each of its’ children individually.
   </p>
</article>

In the above example you’ll see my article element has a new attribute of entry. If the french language and small screen size query conditions are met, the browser downloads the /catalogs/fr/french.cat file. Looks up the "example" entry and replaces everything inside the article element with the following content from the example entry:

example {
   <h1>Titre français</h1>
   <p>
      <img class="phone" src="/example_fr_small.png" alt="Prise de photo de la personne tenant un signe français" /> Les navigateurs actuels montreront ce contenu anglais. Remarquez, il ya trois éléments img. Les trois images sont téléchargées et CSS cahiers médias utilisent l''attribut de classe de cacher deux images et ne laissant que la bonne image de taille visible. Les navigateurs qui comprennent le nouveau code remplacera ce contenu avec le contenu alternatif approprié.
   </p>
   <p>Éléments HTML sont autorisés à l'intérieur du fichier de catalogue. Cela donne au développeur le choix de remplacer la totalité de l'élément de l'article, comme cela se fait ici, ou chacun de ses enfants »individuellement.
   </p>
}

The page now conveys the proper French language content and only one image of suitable content and size for the device gets downloaded. If we need to change the article element we only have one place to make the change. In this example adding a class to the h1 element would require us to change it in each languages catalog file. For this reason my recommended practice is to use entry attributes on each child block level element. Lets see how that looks:

<article entry="example">
   <h1 entry="example_h1">English title</h1>
   <p entry="example_p1">
      <img class="phone" src="/example_en_small.png" alt="Waist up picture of person holding an English sign" />
      <img class="tablet" src="/example_en_medium.png" alt="Full body picture of person holding an English sign" />
      <img class="screen" src="/example_en_large.png" alt="Picture of office with person holding an English sign" />
   </p>
   <p>Current browsers will show this English content. Notice, there are three img elements. All three images are downloaded and CSS media quires use the class attribute to hide two images and leave only the proper sized image visible. Browsers that understand the new code will replace this content with the appropriate alternative content.
   </p>
   <p entry="example_p2">HTML elements are allowed inside the catalog file. This gives the developer the choice to replace the entire article element or each of its’ children individually.
   </p>
</article>

By keeping the entry attribute on the article element we keep the option of replacing this entire block of code at once. Adding entry attributes to the child block level elements increases our content control, allowing us to add an attribute to the h1 element by changing only the one html file.

   <h1 entry="example_h1" class="important">English title</h1>

Browsers replace element content from the body element and proceed through each level of child elements. In this way if our catalog file has alternative content for both the article element and its’ child h1 element the article element is changed first followed by the child h1 element. Therefore both changes would take place.

The New Catalog File

Adding additional languages becomes a matter of translating the catalog files and adding the appropriate language queries. You can also use language and media query combinations to serve different versions of the same language to different devices. This allows you to tailor not only your style and layout but also page terminology and length to different devices. I can only speculate how and why smart phone users may convert better or worse depending on article length and terminology over other users, but I’m sure marketing and SEO experts would love to explore this option.

The new catalog file format type of text/catalog, using a .cat extension, would contain the alternative content for use in external files. Other methods to provide different language query content could be provided much like CSS with, @import, head blocks, and in-line methods.

From the Present to the Future

Overall CSS has been a big success by separating the semantics from the style and layout. Now we need to start the next leg of our journey and separate the semantics from the content. There seems to be resistance to ever closing the HTML5 standard but I believe my proposal would be a large enough structural change to justify it as part of a major version change to HTML6.

<!DOCTYPE html version="6">

The exact methods and syntax for doing this are many and need to be debated and experimented with. This is a huge undertaking with many aspects to consider. A couple examples; is there a need for entry attributes on inline elements or do we need a syntax to add/edit attributes from elements. As always, if we are going to do this we need to do it right the first time.

My goal here is to get more people thinking about these issues and how to best move the web forward. Responsive design isn’t about responding to the needs of a users’ device. It’s about responding to the needs of a user.

Author Bio

Tim Trepanier [http://www.trep4.com] is a retired web master of ABB Inc. [http://www.abb.com]. He fondly remembers when the img element was state of the art html that his browser couldn't display. Now he is pondering the future when IPv6 doesn't have enough addresses for us to integrate with the IFS (Intergalactic Federation of Sentients).