EXtensible Markup Language (XML)


What is XML?

What are the main XML features?

The main difference between XML and HTML

  1. Purpose:
    • XML for data: XML is primarily designed for storing and transporting data in a structured format. It focuses on describing the structure and content of data without specifying how it should be displayed.
    • HTML for web content: HTML is designed for creating web pages and specifying the structure and presentation of content within those pages. It defines elements and attributes that determine how content should be displayed in a web browser.
  2. Content and Structure:
    • XML defines data structure: XML allows users to define their own tags and structure to represent any kind of data, such as configuration data, documents, records, etc. It does not prescribe any specific tags or elements.
    • HTML defines web page structure: HTML is predefined with a set of tags and elements specifically tailored for creating web pages. These tags define the structure of a web page, such as headings, paragraphs, lists, tables, etc., and how content is displayed in a browser.
  3. Presentation vs. Data:
    • HTML focuses on presentation: HTML specifies how content should be displayed in a web browser. It includes tags for formatting text, adding images, creating links, etc.
    • XML separates content from presentation: XML describes the structure and meaning of data but does not dictate how it should be presented. Presentation of XML data is typically handled separately, often using XSLT (eXtensible Stylesheet Language Transformations) to transform XML into HTML or other formats for display.

HTML Code (displaying information)XML Code ( describing information)

<h1>Book</h1>
<h2>New Perspectives of XML</h2>
<h3>Patrick Carey</h3>
<ol>Chapters
    <li>Creating an XML Document</li>
    <li>Binding XML Data with IE</li>
    <li>Creating a valid XML Document</li>
    <li>Working with namespaces and Schemas</li>
    <li>Working with Cascading Style Sheets</li>
    <li>Working with XSLT</li>
</ol>

<book>
    <title>New Perspectives of XML</title>
    <author>Patrick Carey</author>
    <contents>
        <chapter>Creating an XML Document</chapter>
        <chapter>Binding XML Data with IE</chapter>
        <chapter>Creating a valid XML Document</chapter>
        <chapter>Working with namespaces and Schemas</chapter>
        <chapter>Working with Cascading Style Sheets</chapter>
        <chapter>Working with XSLT<chapter>
    </chapter></chapter></contents>
</book>
View HTML document View XML document
The HTML document has a h1, h2, h3 and an ol The XML document has a book, an author, a contents and a list of chapters. The XML document does not provide any formating. It is just pure information wrapped in XML tags. Additional software is required to send, receive and process or display the XML.

Both use tags. However, HTML tags are predefined. With XML you can define your own tags.

HTML is designed to display data in a web page. Therefore, HTML tags specify how data should be displayed (e.g. display this data as a heading with h1, h2 etc.).

HTML is designed to describe data. There are no built-in presentation features. Therefore, XML uses tags to identify a piece of data rather than specify how to display it (e.g. <author></author>).

XML is extensible

We can extend our XML document too carry more information (e.g. add more chapter elements or define new tags).

In the example above the tags <title> and <author> do not exist in any standards, we invented them. Although title is an existing HTML tag, in XML it means something different.

Use of XML

XML is used to Store data, Share data and Transport Data

XML can be used to create New Languages

XML was used to create:

Associated Technologies

XML has spawned a wide range of associated technologies that complement its capabilities and enable various functionalities. Here are some key XML-associated technologies:

An example XML document

XML documents use a self-describing and simple syntax.


  <?xml version="1.0" encoding="UTF-8"?>
  <book>
      <title>New Perspectives of XML</title>
      <author>Patrick Carey</author>
      <contents>
          <chapter>Creating an XML Document</chapter>
          <chapter>Binding XML Data with IE</chapter>
          <chapter>Creating a valid XML Document</chapter>
          <chapter>Working with namespaces and Schemas</chapter>
          <chapter>Working with Cascading Style Sheets</chapter>
          <chapter>Working with XSLT
      </chapter></contents>
  </book>
  

The first line of the document - the XML declaration - defines the XML version and the character encoding used in the document. In this case the document conforms with 1.0 specification of XML and uses the UTF-8 (a variety of Unicode). UTF-8 (Unicode Transformation Format-8) is a variable-width character encoding capable of encoding all possible characters defined by Unicode. It is widely used on the internet and in computer systems as a means of representing text data, supporting a wide range of languages and characters from various scripts and alphabets.

The following are the same:

<?xml version="1.0"?>

<?xml version="1.0" encoding="UTF-8"?>

The next line describes the root element of the document (it is like saying: "this document is a book"):

<book></book>

The next 10 lines describe 3 child elements of the root (title, author, contents and 5 sub-elements of contents):

<title>New Perspectives of XML</title>
      <author>Patrick Carey</author>
      <contents>
          <chapter>Creating an XML Document</chapter>
          <chapter>Binding XML Data with IE</chapter>
          <chapter>Creating a valid XML Document</chapter>
          <chapter>Working with namespaces and Schemas</chapter>
          <chapter>Working with Cascading Style Sheets</chapter>
          <chapter>Working with XSLT
      </chapter></contents>

The next line describes the end of the root element:

Reading the XML document we can detect that this XML document contains a book with title New Perspectives of XML, written by the author Patrick Carey, with a list f chapters.

XML elements

XML well-formed rules

XML follows a set of rules to ensure that XML documents are well-formed and valid. Here are the key rules of XML:

In addition, element names must be valid.

Element naming

Make tag names descriptive - they should describe teh data they contain.

Element names can be as long as you like, but not not exaggerate. Names should be short and simple, like this: <book_title></book_title> not like <the_title_of_the_book></the_title_of_the_book>

Valid XML

Valid XML refers to XML documents that adhere to the rules and guidelines defined by the XML specification:

XML - DTD Elements

Example - DTD

<!--ELEMENT shoplist (item+)-->
<!--ELEMENT item (#PCDATA)-->

The DTD defines an element called shoplist containing another element called item. How many items?

Item contains text (parsed character data)
The main block of the document <!ELEMENT ...>
Declaring the internal/external DTD <!DOCTYPE...>

XML - DTDs embedded

DTDs are introduced into XML documents by using the <!DOCTYPE> declaration

A DTDs can be embedded in an XML document, e.g.

<?xml version="1.0" encoding="UTF-8"?>
        <!DOCTYPE shoplist [
        <!ELEMENT shoplist (item+)>

        <!--ELEMENT item (#PCDATA)-->
        ]&gt;
    <shoplist>
      <item>Bread</item>
      <item>Butter</item>
    </shoplist>
  

XML - DTDs external

A PUBLIC type can be locate a DTD from a known repository of DTDs.

The SYSTEM type locates a DTD via a URL, e.g.:

<?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE shoplist SYSTEM "shoplist.dtd">

    <shoplist>
      <item>Bread</item>
      <item>Butter</item>
    </shoplist>
  

Specifying sequence of elements

Specifying sequence of elements with: comma

Comma means?

, (comma): is used to separate multiple child elements within parentheses to specify their order within a parent element.

    <!--ELEMENT module (code, name)-->
    <!--ELEMENT code (#PCDATA)-->
    <!--ELEMENT name (#PCDATA)-->
    <module>
      <code>4COSC011</code>
      <name>Web Design and Development</name>
    </module>
  

Specifying sequence of elements with: + ? *

+ (plus): Specifies that the preceding element can appear one or more times.

? (question mark): Specifies that the preceding element is optional and can appear zero or one time.

* (asterisk): Specifies that the preceding element can appear zero or more times.

      <!--ELEMENT filmlist (film+)-->
      <!--ELEMENT film (title+, year?, actor*)-->
      <!--ELEMENT title (#PCDATA)-->
      <!--ELEMENT year (#PCDATA)-->
      <!--ELEMENT actor (#PCDATA)-->

      <filmlist>
        <film>
            <title>Poor Things</title>
            <year>2023</year>
            <actor>Emma Stone</actor>
            <actor>Willem Dafoe</actor>
            <actor>Vicki Pepperdine</actor>
            <actor>Ramy Youssef</actor>
        </film>
        <film>
            <title>SpacemanI</title>
            <title>Spaceman II</title>
        </film>
      </filmlist>
    

Specifying sequence of elements with: |

| (pipe): Specifies that one of the listed elements can appear in the sequence (like or).

      <!--ELEMENT dessert (cream | fruit)-->
      <!--ELEMENT cream (#PCDATA)-->
      <!--ELEMENT fruit (#PCDATA)-->

      <dessert>
        <cream>Vanilla</cream>
        <fruit>Strawberries</fruit>
      </dessert>

Elements

() (parentheses): Specifies a group.

 <!--ELEMENT first_course ((bread, soup) | starter)-->

A heading is made up fo a title and an address in any order

 <!--ELEMENT heading ((title, address) | (address, title))-->

An empty element

 <!--ELEMENT img EMPTY-->

In XML document

 <img/>

Recap

Note that the syntax for writing comments in XML is similar to the one for HTML
<!-- This is a comment -->