EXtensible Markup Language (XML)
What is XML?
- XML stands for eXtensible Markup Language.
- Became a W3C Recommendation in 1998.
- Is a subset of Standard Generalized Markup Language (SGML)
- XML is a meta-markup language
- we can define rules for encoding documents;
- we can include metadata to describe the structure and content of the data within the document.
- XML is used for data interchange between different systems.
What are the main XML features?
- Extensibility: Users can define custom tags and structures to suit their specific needs, making XML adaptable to various data formats and applications.
- Self-descriptive: XML documents contain metadata that describes the structure and meaning of the data they contain, making them self-descriptive and facilitating understanding and processing by both humans and machines.
- Hierarchical structure: XML organizes data in a hierarchical structure using nested elements or tags, allowing representation of complex relationships between data entities.
- Platform independence: XML documents can be created, read, and processed on different operating systems and platforms, ensuring platform independence.
- Interoperability: XML facilitates data exchange between different systems and platforms, enabling seamless communication and integration by adhering to standardized syntax and structure.
- Well-defined syntax: XML has a clear and well-defined syntax based on angle brackets (< >) for marking up elements and tags, making it easy to parse and process by software applications.
- Support for Unicode: XML supports Unicode character encoding, ensuring internationalization and multilingual support.
- Validation:XML documents can be validated against a defined schema to ensure their structure and content adhere to predefined rules and constraints.
- Transformation: XML documents can be transformed into different formats using technologies like XSLT (eXtensible Stylesheet Language Transformations), enabling the presentation and transformation of data for various purposes.
- Querying: XML documents can be queried using technologies like XPath and XQuery, allowing selective retrieval of data based on specific criteria.
- Serialization: XML documents can be serialized into a text-based format for storage or transmission over networks, and deserialized back into their original hierarchical structure when needed.
The main difference between XML and HTML
- Purpose:
- XML for data: XML is primarily designed for storing and transporting data in a structured format. It focuses on describing the structure and content of data without specifying how it should be displayed.
- HTML for web content: HTML is designed for creating web pages and specifying the structure and presentation of content within those pages. It defines elements and attributes that determine how content should be displayed in a web browser.
- Content and Structure:
- XML defines data structure: XML allows users to define their own tags and structure to represent any kind of data, such as configuration data, documents, records, etc. It does not prescribe any specific tags or elements.
- HTML defines web page structure: HTML is predefined with a set of tags and elements specifically tailored for creating web pages. These tags define the structure of a web page, such as headings, paragraphs, lists, tables, etc., and how content is displayed in a browser.
- Presentation vs. Data:
- HTML focuses on presentation: HTML specifies how content should be displayed in a web browser. It includes tags for formatting text, adding images, creating links, etc.
- XML separates content from presentation: XML describes the structure and meaning of data but does not dictate how it should be presented. Presentation of XML data is typically handled separately, often using XSLT (eXtensible Stylesheet Language Transformations) to transform XML into HTML or other formats for display.
| HTML Code (displaying information) | XML Code ( describing information) |
|---|---|
|
|
| View HTML document | View XML document |
| The HTML document has a h1, h2, h3 and an ol | The XML document has a book, an author, a contents and a list of chapters. The XML document does not provide any formating. It is just pure information wrapped in XML tags. Additional software is required to send, receive and process or display the XML. |
Both use tags. However, HTML tags are predefined. With XML you can define your own tags.
HTML is designed to display data in a web page. Therefore, HTML tags specify how data should be displayed (e.g. display this data as a heading with h1, h2 etc.).
HTML is designed to describe data. There are no built-in presentation features. Therefore, XML uses tags to identify a piece of data rather than specify how to display it (e.g. <author></author>).
XML is extensible
We can extend our XML document too carry more information (e.g. add more chapter elements or define new tags).
In the example above the tags <title> and <author> do not exist in any standards, we invented them. Although title is an existing HTML tag, in XML it means something different.
Use of XML
XML is used to Store data, Share data and Transport Data
- Since XML is stored in plain text format, XML provides a software and hardware independent way of string and sharing data.
- With XML, data can be exchanged between incompatible systems. We can share data between different types of applications
- XML is often used for distributing data over the Internet.
XML can be used to create New Languages
XML was used to create:
- SVG is an XML-based language for describing two-dimensional vector graphics.
- XHTML (HTML rewritten in XML), this evolved to HTML5
- RSS (Really Simple Syndication) - web feed format used to publish frequently web content in a standardized format (e.g. news headings, blog entries etc.)
- XSLT (eXtensible Stylesheet Language Transformations) a language for transforming XML documents into other formats, such as HTML, text, or XML.
- XQuery a query language designed for querying and extracting data from XML documents. It provides powerful capabilities for searching, filtering, and manipulating XML data, similar to SQL for relational databases.
- and others
Associated Technologies
XML has spawned a wide range of associated technologies that complement its capabilities and enable various functionalities. Here are some key XML-associated technologies:
- Parser - is an application (software component or library) that reads the XML document and analyzes its structure according to the XML specification.
- XML DTSs and Schemas - to validate that an XML document conforms to a predetermined structure.
- XSLT is a language for transforming XML documents into other formats, such as HTML, text, or XML itself.
- XPath is a language used for navigating and querying XML documents.
- XQuery is a query language designed for querying and extracting data from XML documents.
- SOAP is a protocol for exchanging structured information between distributed systems using XML-based messages.
An example XML document
XML documents use a self-describing and simple syntax.
|
The first line of the document - the XML declaration - defines the XML version and the character encoding used in the document. In this case the document conforms with 1.0 specification of XML and uses the UTF-8 (a variety of Unicode). UTF-8 (Unicode Transformation Format-8) is a variable-width character encoding capable of encoding all possible characters defined by Unicode. It is widely used on the internet and in computer systems as a means of representing text data, supporting a wide range of languages and characters from various scripts and alphabets.
The following are the same:
<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>
The next line describes the root element of the document (it is like saying: "this document is a book"):
|
The next 10 lines describe 3 child elements of the root (title, author, contents and 5 sub-elements of contents):
|
The next line describes the end of the root element:
|
Reading the XML document we can detect that this XML document contains a book with title New Perspectives of XML, written by the author Patrick Carey, with a list f chapters.
XML elements
- XML documents have relationships
Elements are related as parents and children. In the previous example:- book is the root element
- title, author, and contents are child elements of book.
- book is the parent element to title, author, and contents
- while title, author, and contents are siblings because the have the same parent.
- Elements have content
An XML elements is a fundamental building block used to define the structure and content of an XML document. They consist of:- a start tag, an end tag, and the content or data or other elementts enclosed between them
- mixed content (containing text and other elements)
- simple content (containing only text)
- empty content (containing only text)
- elements can have attributes
XML well-formed rules
XML follows a set of rules to ensure that XML documents are well-formed and valid. Here are the key rules of XML:
- XML documents must have a root element (like the book element in the example above)
- All XML elements must have a matching closing tag
- empty elements can be written as <price></price> or <price/> (known as self closing tag)
- All XML names are case sensitive tag
- Does not match <Price></price> only <price></price>
- XML must be properly nested
- XML attribute values must be quoted
- This is correct <order date = " 24/03/2024 ">
- This is incorrect <order date = 24/03/2024 >
In addition, element names must be valid.
- Names can contain letters, numbers, and other characters
- Names must not start with a number or punctuation character
- Names must not start with the letters xml (or XML, or Xml etc.)
- Names cannot contain spaces
- Not allowed: <My element>
Element naming
Make tag names descriptive - they should describe teh data they contain.
Element names can be as long as you like, but not not exaggerate. Names should be short and simple, like this: <book_title></book_title> not like <the_title_of_the_book></the_title_of_the_book>
Valid XML
Valid XML refers to XML documents that adhere to the rules and guidelines defined by the XML specification:
- Well formed documents contain properly written XML tag conforming to XML syntax
- Valid documents also comply to a set of rules defined in the DTD (os schema)
- Why validate?
- Well enforce a structure and thereby the Integrity of the data
- Helps manage large documents
XML - DTD Elements
Example - DTD
|
The DTD defines an element called shoplist containing another element called item. How many items?
Item contains text (parsed character data)
The main block of the document <!ELEMENT ...>
Declaring the internal/external DTD <!DOCTYPE...>
XML - DTDs embedded
DTDs are introduced into XML documents by using the <!DOCTYPE> declaration
A DTDs can be embedded in an XML document, e.g.
|
XML - DTDs external
A PUBLIC type can be locate a DTD from a known repository of DTDs.
The SYSTEM type locates a DTD via a URL, e.g.:
|
Specifying sequence of elements
Specifying sequence of elements with: comma
Comma means?
, (comma): is used to separate multiple child elements within parentheses to specify their order within a parent element.
|
Specifying sequence of elements with: + ? *
+ (plus): Specifies that the preceding element can appear one or more times.
? (question mark): Specifies that the preceding element is optional and can appear zero or one time.
* (asterisk): Specifies that the preceding element can appear zero or more times.
|
Specifying sequence of elements with: |
| (pipe): Specifies that one of the listed elements can appear in the sequence (like or).
|
Elements
() (parentheses): Specifies a group.
|
A heading is made up fo a title and an address in any order
|
An empty element
|
In XML document
|
Recap
- Well-formed documents vs valid documents
- Contents of DTD elements
- Elements contain child elements <!ELEMENT shoplist (item+)>
- Elements contain text (Parsed Character Data) <!ELEMENT item (#PCDATA)>
- Elements contain text (Parsed Character Data) <!ELEMENT img EMPTY>
- Combination rules
- A, B
- A?
- A+
- A*
- A|B
- To declare an internal or external DTD <!DOCTYPE ...>
Note that the syntax for writing comments in XML is similar to the one for HTML
<!-- This is a comment -->