Iso 8859

Information about Iso 8859

ISO 8859, more formally ISO/IEC 8859, is a joint ISO and IEC standard for 8-bit character encodings for use by computers. The standard is divided into numbered, separately published parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc., each of which may be informally referred to as a standard in itself. There are currently 15 parts as of 2006 excluding the abandoned ISO/IEC 8859-12 standard.

Introduction

While the bit patterns of the 95 printable ASCII characters are sufficient to exchange information in modern English, most other languages that use the Latin alphabet need additional symbols not covered by ASCII, such as ß (German), ñ (Spanish), Ã¥ (Swedish and other Nordic languages) and ő (Hungarian). ISO 8859 sought to remedy this problem by utilizing the eighth bit in an 8-bit byte in order to allow positions for another 128 characters. (This bit was previously used for data transmission protocol information, or was left unused.) However, more characters were needed than could fit in a single 8-bit character encoding, so several mappings were developed, including at least 10 just to cover the Latin script.

The ISO 8859-n encodings only contain printable characters, and were designed to be used in conjunction with control characters mapped to the unassigned bytes. To this end a series of encodings registered with the IANA add the C0 control set (control characters mapped to bytes 0 to 31) from ISO 646 and the C1 control set (control characters mapped to bytes 127 to 159) from ISO 6429, resulting in full 8-bit character maps with most, if not all, bytes assigned. These sets have ISO-8859-n as their preferred MIME name or, in cases where a preferred MIME name isn't specified, their canonical name. Many people use the terms ISO 8859-n and ISO-8859-n interchangeably. ISO 8859-11 did not get such a charset assigned presumably because it was almost identical to TIS 620.

Characters

The ISO 8859 standard is designed for reliable information exchange, not typography; the standard omits symbols needed for high-quality typography, such as optional ligatures, curly quotation marks, dashes, etc. As a result, high-quality typesetting systems often use proprietary or idiosyncratic extensions on top of the ASCII and ISO 8859 standards, or use Unicode instead.

As a rule of thumb, if a character or symbol was not already part of a widely used data-processing character set and was also not usually provided on typewriter keyboards for a national language, it didn't get in. Hence the directional double quotation marks « and » used for some European languages were included, but not the directional double quotation marks and used for English and some other languages. French didn't get its œ and Œ ligatures because they could be typed as 'oe'. Ÿ, needed for all-caps text, was left out as well. These characters were, however, included later with ISO 8859-15, which also introduced the new euro sign character €. Likewise Dutch did not get the 'ij' and 'IJ' letters, because Dutch speakers had gotten used to typing these as two letters instead. Romanian did not initially get its 'Ș/ș' and 'Ț/ț' (with comma) letters, because these letters were initially unified with 'Ş/ş' and 'Ţ/ţ' (with cedilla) by the Unicode Consortium, considering the shapes with comma beneath to be glyph variants of the shapes with cedilla. However, the letters with explicit comma below were later added to the Unicode standard and are also in ISO 8859-16.

Most of the ISO 8859 encodings provide diacritic marks required for various European languages. Others provide non-Latin alphabets: Greek, Cyrillic, Hebrew, Arabic and Thai. Most of the encodings contain only spacing characters although the Hebrew and Arabic ones do also contain combining characters. However, the standard makes no provision for the scripts of East Asian languages (CJK), as their ideographic writing systems require many thousands of code points. Although it uses Latin based characters, Vietnamese does not fit into 96 positions (without using combining diacritics) either. Each Japanese syllabic alphabet (hiragana or katakana, see Kana) would fit, but like several other alphabets of the world they aren't encoded in the ISO 8859 system.

The Parts of ISO 8859

ISO 8859 is divided into the following parts:
Part 1Latin-1
Western European
Perhaps the most widely used part of ISO 8859, covering most Western European languages: Danish, Dutch (partial[1]), English, Faeroese, Finnish (partial[2]), French (partial<ref name="two" />), German, Icelandic, Irish, Italian, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, and Swedish. Languages from other parts of the world are also covered, including: Eastern European Albanian, Southeast Asian Indonesian, as well as the African languages Afrikaans and Swahili. The missing euro sign and capital Ÿ are in the revised version ISO 8859-15. The corresponding IANA-approved character set ISO-8859-1 is the default encoding for legacy HTML documents and for documents transmitted via MIME messages, such as HTTP responses when the document's media type is "text" (as in "text/html").
Part 2Latin-2
Central European
Supports those Central and Eastern European languages that use the Latin alphabet, including Bosnian, Polish, Croatian, Czech, Slovak, Slovenian, and Hungarian. The missing euro sign can be found in version ISO 8859-16.
Part 3Latin-3
South European
Turkish, Maltese, and Esperanto. Largely superseded by ISO 8859-9 for Turkish and Unicode for Esperanto.
Part 4Latin-4
North European
Estonian, Latvian, Lithuanian, Greenlandic, and Sami.
Part 5Latin/CyrillicCovers mostly Slavic languages that use a Cyrillic alphabet, including Belarusian, Bulgarian, Macedonian, Russian, Serbian, and Ukrainian (partial[3]).
Part 6Latin/ArabicCovers the most common Arabic language characters. Doesn't support other languages using the Arabic script. Needs to be BiDi and cursive joining processed for display.
Part 7Latin/GreekCovers the modern Greek language (monotonic orthography). Can also be used for Ancient Greek written without accents or in monotonic orthography, but lacks the diacritics for polytonic orthography. These were introduced with Unicode.
Part 8Latin/HebrewCovers the modern Hebrew alphabet as used in Israel. In practice two different encodings exist, logical order (needs to be BiDi processed for display) and visual (left-to-right) order (in effect, after bidi processing and line breaking).
Part 9Latin-5
Turkish
Largely the same as ISO 8859-1, replacing the rarely used Icelandic letters with Turkish ones. It is also used for Kurdish.
Part 10Latin-6
Nordic
a rearrangement of Latin-4. Considered more useful for Nordic languages. Baltic languages use Latin-4 more.
Part 11Latin/ThaiContains most glyphs needed for the Thai language. Same as TIS 620.
non-existent
Part 12
Latin/DevanagariThe work in making a part of 8859 for Devanagari was officially abandoned in 1997. ISCII and Unicode/ISO/IEC 10646 cover Devanagari.
Part 13Latin-7
Baltic Rim
Added some characters for Baltic languages which were missing from Latin-4 and Latin-6.
Part 14Latin-8
Celtic
Covers Celtic languages such as Gaelic and the Breton language.
Part 15Latin-9A revision of 8859-1 that removes some little-used symbols, replacing them with the euro sign and the letters Š, š, Ž, ž, Œ, œ, and Ÿ, which completes the coverage of French, Finnish and Estonian.
Part 16Latin-10
South-Eastern European
Intended for Albanian, Croatian, Hungarian, Italian, Polish, Romanian and Slovenian, but also Finnish, French, German and Irish Gaelic (new orthography). The focus lies more on letters than symbols. The currency sign is replaced with the euro sign.
1. ^ only the IJ/ij (letter IJ) is missing, which is usually represented as IJ.
2. ^ missing characters are in ISO 8859-15.
3. ^ missing Ґ/ґ characters were reintroduced into Ukrainian in 1991.


Each part of ISO 8859 is designed to support languages that often borrow from each other, so the characters needed by each language are usually accommodated by a single part. However, there are some characters and language combinations that are not accommodated without transcriptions. Efforts were made to make conversions as smooth as possible. For example, German has all its seven special chars at the same positions in all Latin variants (1-4, 9-10, 13-16), and in many positions the characters only differ in the diacritics between the sets. In particular, variants 1-4 were designed jointly, and have the property that every encoded character appears either at a given position or not at all.

Table

Comparison of the various parts of ISO 8859
Binary Oct Dec Hex 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16
10100000 240 160 A0 Non-breaking space (NBSP)
10100001 241 161 A1¡ĄĦĄЁ  Â¡ĄÂ¡Ą
10100010 242 162 A2¢˘ĸЂ Â¢Â¢Ē¢¢ą
10100011 243 163 A3£ŁÂ£ŖЃ Â£Ģ£Ł
10100100 244 164 A4 ¤ЄÂ¤Â¤Ī¤Ċ
10100101 245 165 A5Â¥Ľ ĨЅ Â¥ĨċÂ¥
10100110 246 166 A6¦ŚĤĻІ Â¦Ķ¦Š
10100111 247 167 A7 §Ї Â§Â§
10101000 250 168 A8 ¨Ј Â¨ĻØš
10101001 251 169 A9©ŠİŠЉ Â©Đ©
10101010 252 170 AAªŞĒЊ ͺÃ—ÂªŠŖÂªȘ
10101011 253 171 AB«ŤĞĢЋ Â«Ŧ««
10101100 254 172 AC¬ŹĴŦЌ،¬ŽÂ¬Â¬Ź
10101101 255 173 AD soft hyphen (SHY)SHY
10101110 256 174 AEÂ®Ž ŽЎ  Â®Ū®ź
10101111 257 175 AF¯ŻÂ¯Џ Â¯ŊƟ¯Ż
10110000 260 176 B0 °А Â°Â°Â°
10110001 261 177 B1±ąħąБ Â±ąÂ±Â±
10110010 262 178 B2²˛Â²˛В Â²ē²Ġ²Č
10110011 263 179 B3³łÂ³ŗГ Â³ģ³ġ³ł
10110100 264 180 B4 Â´Д ΄Â´īŽ
10110101 265 181 B5µľÂµĩЕ ΅Âµĩµµ
10110110 266 182 B6¶śĥļЖ ΆÂ¶ķ¶
10110111 267 183 B7·ˇÂ·ˇЗ Â·Â·Â·
10111000 270 184 B8 Â¸И ΈÂ¸ļøž
10111001 271 185 B9¹šıšЙ ΉÂ¹đ¹¹č
10111010 272 186 BAºşēК ΊÃ·ÂºšŗÂºș
10111011 273 187 BB»ťğģЛ؛»ŧ»»
10111100 274 188 BC¼źĵŧМ ΌÂ¼žÂ¼Œ
10111101 275 189 BD½˝Â½ŊН Â½Â½œ
10111110 276 190 BE¾ž žО ΎÂ¾ū¾Ÿ
10111111 277 191 BF¿żŋП؟Ώ Â¿ŋæ¿ż
11000000 300 192 C0ÀŔÀĀР ΐ Ã€ĀĄÃ€
11000001 301 193 C1 ÁСءΑ ÃĮÁ
11000010 302 194 C2 ÂТآΒ Ã‚ĀÂ
11000011 303 195 C3ÃƒĂ ÃƒУأΓ ÃƒĆÃĂ
11000100 304 196 C4 ÄФؤΔ Ã„Ã„
11000101 305 197 C5Ã…ĹĊÃ…ХإΕ Ã…Ã…Ć
11000110 306 198 C6ÆĆĈÆЦئΖ Ã†ĘÆ
11000111 307 199 C7 ÇĮЧاΗ Ã‡ĮĒÇ
11001000 310 200 C8ÈČÈČШبΘ ÃˆČČÈ
11001001 311 201 C9 ÉЩةΙ Ã‰Ã‰
11001010 312 202 CAÊĘÊĘЪتΚ ÃŠĘŹÃŠ
11001011 313 203 CB ËЫثΛ Ã‹ĖË
11001100 314 204 CCÃŒĚÃŒĖЬجΜ ÃŒĖĢÃŒ
11001101 315 205 CD ÍЭحΝ ÃĶÍ
11001110 316 206 CE ÎЮخΞ ÃŽĪÎ
11001111 317 207 CFÏĎÏĪЯدΟ ÃĻÏ
11010000 320 208 D0ÃĐ ĐаذΠ ĞЊŴÐ
11010001 321 209 D1ÑŃÑŅбرΡ Ã‘ŅŃÑŃ
11010010 322 210 D2Ã’ŇÃ’Ōвز  Ã’ŌŅÃ’
11010011 323 211 D3 ÓĶгسΣ Ã“Ã“
11010100 324 212 D4 ÔдشΤ Ã”ŌÔ
11010101 325 213 D5ÕŐĠÕеصΥ Ã•Ő
11010110 326 214 D6 ÖжضΦ Ã–Ã–
11010111 327 215 D7 ×зطΧ Ã—Ũ×׌
11011000 330 216 D8ØŘĜØиظΨ Ã˜ŲØŰ
11011001 331 217 D9ÙŮÙŲйعΩ Ã™ŲŁÃ™
11011010 332 218 DA ÚкغΪ ÃšŚÃš
11011011 333 219 DBÛŰÛл Ϋ Ã› ŪÛ
11011100 334 220 DC Üм ά Ãœ Ãœ
11011101 335 221 DD ÝŬŨн έ İÝ ŻÃĘ
11011110 336 222 DEÞŢŜŪо ή ŞÃž ŽŶÞȚ
11011111 337 223 DF ßп ίß฿ß
11100000 340 224 E0àŕàāрـΰאàāąÃ 
11100001 341 225 E1 áсفαבáįá
11100010 342 226 E2 âтقβגâāâ
11100011 343 227 E3ãă Ã£уكγדãćãă
11100100 344 228 E4 äфلδהää
11100101 345 229 E5Ã¥ĺċÃ¥хمεוååć
11100110 346 230 E6æćĉæцنζזæęæ
11100111 347 231 E7 çįчهηחçįēç
11101000 350 232 E8èčèčшوθטèččè
11101001 351 233 E9 éщىιיéé
11101010 352 234 EAêęêęъيκךêęźÃª
11101011 353 235 EB ëыًλכëėë
11101100 354 236 ECìěìėьٌμלìėģì
11101101 355 237 ED íэٍνםíķí
11101110 356 238 EE îюَξמîīî
11101111 357 239 EFïďïīяُοןïļï
11110000 360 240 F0ðđ đȑِπנğðšŵðđ
11110001 361 241 F1ñńñņёّρסñņńñń
11110010 362 242 F2òňòōђْςעòōņò
11110011 363 243 F3 Ã³ķѓ σףóó
11110100 364 244 F4 ôє τפôōô
11110101 365 245 F5õőġõѕ υץõő
11110110 366 246 F6 öі φצöö
11110111 367 247 F7 ÷ї χק÷ũ÷÷ś
11111000 370 248 F8øřĝøј ψרøųøű
11111001 371 249 F9ùůùųљ ωשùųłÃ¹
11111010 372 250 FA úњ ϊתúśÃº
11111011 373 251 FBûűûћ ϋ Ã»ūû
11111100 374 252 FC üќ ό Ã¼ Ã¼
11111101 375 253 FD ýŭũ§ ύLRMıý żÃ½ę
11111110 376 254 FEþţŝūў ώRLMşÃ¾ žŷþț
11111111 377 255 FFÿ˙џ   Ã¿ĸ Ã¿


At position 0xA0 there's always the non breaking space and 0xAD is mostly the soft hyphen, which only shows at line breaks. Other empty fields are either unassigned or the system used isn't able to display them.

There are new additions as ISO/IEC 8859-7:2003 and ISO/IEC 8859-8:1999 versions. LRM stands for left-to-right mark (U+200E) and RLM stands for right-to-left mark (U+200F).

Relationship to Unicode and the UCS

Since 1991, the Unicode Consortium has been working with ISO to develop the Unicode Standard and ISO/IEC 10646: the Universal Character Set (UCS) in tandem. This pair of standards was created to unify the ISO 8859 character repertoire, among others, by assigning each character, initially, to a 16-bit code value, with some code values left unassigned. Over time, their models adapted to map characters to abstract numeric code points rather than fixed bit-width values, so that more code points and encoding methods could be supported.

Unicode and ISO/IEC 10646 currently assign about 100,000 characters to a code space consisting of over a million code points, and they define several standard encodings that are capable of representing every available code point. The standard encodings of Unicode and the UCS use sequences of one to four 8-bit code values (UTF-8), sequences of one or two 16-bit code values (UTF-16), or one 32-bit code value (UTF-32 or UCS-4). There is also an older encoding that uses one 16-bit code value (UCS-2), capable of representing one-seventeenth of the available code points. Of these encoding forms, only UTF-8's byte sequences are in a fixed order; the others are subject to platform-dependent byte ordering issues that may be addressed via special codes or indicated via out-of-band means.

Newer editions of ISO 8859 express characters in terms of their Unicode/UCS names and the U+nnnn notation, effectively causing each part of ISO 8859 to be a Unicode/UCS character encoding scheme that maps a very small subset of the UCS to single 8-bit bytes. The first 256 characters in Unicode and the UCS are identical to those in ISO-8859-1.

Single byte character sets including the parts of ISO 8859 and derivatives of them were favored throughout the 1990s, having the advantages of being well-established and more easily implemented in software: the equation of one byte to one character is simple and adequate for most single-language applications, and there are no combining characters or variant forms.

As the relative cost, in computing resources, of using more than one byte per character began to diminish, programming languages and operating systems added native support for Unicode alongside their system of code pages. Windows NT was quite an early adopter of Unicode. However Unicode support in Windows 9x required linking with a special compatibility layer or restricting your design to a very small subset of the Windows API discouraging its use. As Unicode-enabled operating systems became more widespread, ISO 8859 and other legacy encodings became less popular. While remnants of ISO 8859 and single-byte character models remain entrenched in many operating systems, programming languages, data storage systems, networking applications, display hardware, and end-user application software, most modern computing applications use Unicode internally, and rely on conversion tables to map to and from other encodings, when necessary.

Development status

The ISO/IEC 8859 standard was maintained by ISO/IEC Joint Technical Committee 1, Subcommittee 2, Working Group 3 (ISO/IEC JTC 1/SC 2/WG 3). In June 2004, WG 3 disbanded, and maintenance duties were transferred to SC 2. The standard is not currently being updated, as the Subcommittee's only remaining working group, WG 2, is concentrating on development of ISO/IEC 10646.

References

International Organization for Standardization (Organisation internationale de normalisation), widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations.
..... Click the link for more information.
The International Electrotechnical Commission[1] (IEC) is a not-for-profit, non-governmental international standards organization that prepares and publishes International Standards for all electrical, electronic and related technologies – collectively known
..... Click the link for more information.
A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes referred to as code page) with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in
..... Click the link for more information.
20th century - 21st century - 22nd century
1970s  1980s  1990s  - 2000s -  2010s  2020s  2030s
2003 2004 2005 - 2006 - 2007 2008 2009

2006 by topic:
News by month
Jan - Feb - Mar - Apr - May - Jun
..... Click the link for more information.
American Standard Code for Information Interchange (ASCII), generally pronounced ask-ee IPA: /ˈæski/ ( [1] ), is a character encoding based on the English alphabet.
..... Click the link for more information.
English}}} 
Writing system: Latin (English variant) 
Official status
Official language of: 53 countries
Regulated by: no official regulation
Language codes
ISO 639-1: en
ISO 639-2: eng
ISO 639-3: eng  
..... Click the link for more information.
Latin alphabet
Child systems Numerous: see Alphabets derived from the Latin
Sister systems Cyrillic
Coptic
Armenian
Runic/Futhark
Unicode range See Latin characters in Unicode
ISO 15924 Latn

Note
..... Click the link for more information.
ß is a letter in the German alphabet. Its German name is Eszett (IPA: [ɛsˈtsɛt], lexicalized expression for sz) or scharfes S (sharp S).
..... Click the link for more information.
German language (Deutsch, ] ) is a West Germanic language and one of the world's major languages.
..... Click the link for more information.
Ñ (lower case ñ) is a letter of the modern Roman alphabet formed by an N with a diacritical tilde. It is most notably used in the Spanish alphabet and Filipino Alphabet, where it represents a palatal nasal (IPA:
..... Click the link for more information.

 Spanish, Castilian
}}} 
Writing system: Latin (Spanish variant)
Language codes
ISO 639-1: none
ISO 639-2:
ISO 639-3: —

Spanish (
..... Click the link for more information.
The letter Å represents various ò sounds in the Swedish, Finnish, Danish, Norwegian, North Frisian, Walloon, Chamorro and Istro-Romanian language alphabets. Other alphabets using the letter include the Lule Sami, Skolt Sami and Southern Sami alphabet.
..... Click the link for more information.
Swedish}}} 
Official status
Official language of:  European Union
 European Union (in Noarootsi along with Estonian) [1]
 Finland
 Sweden (de facto)
Nordic Council
..... Click the link for more information.
North Germanic languages make up one of the three branches of the Germanic languages, a sub-family of the Indo-European languages, along with the West Germanic languages and the East Germanic languages.
..... Click the link for more information.
Hungarian (magyar nyelv listen  ) is a Finno-Ugric language (more specifically an Ugric language) unrelated to most other languages in Europe.
..... Click the link for more information.
byte (pronounced /baɪt/) is a unit of measurement of information storage, most often consisting of eight bits. In many computer architectures it is a unit of memory addressing.
..... Click the link for more information.
In computing and telecommunication, a control character or non-printing character is a code point (a number) in a character set that does not in itself represent a written symbol.
..... Click the link for more information.
Internet Assigned Numbers Authority (IANA) is the entity that oversees global IP address allocation, DNS root zone management, and other Internet protocol assignments. It is operated by ICANN.
..... Click the link for more information.
The C0 and C1 control code sets define control codes for use in text. C0, originally defined in ISO 646, defines codes in the range 00HEX–1FHEX. C1, originally defined in ISO 6429, defines codes in the range 80HEX–9FHEX.
..... Click the link for more information.
ISO 646 is an ISO standard that since 1972 has specified a 7-bit character code from which several national standards are derived. Since the portion of ISO 646 shared by all countries specified only those letters used in the English alphabet, other countries using the Latin
..... Click the link for more information.
The C0 and C1 control code sets define control codes for use in text. C0, originally defined in ISO 646, defines codes in the range 00HEX–1FHEX. C1, originally defined in ISO 6429, defines codes in the range 80HEX–9FHEX.
..... Click the link for more information.
ANSI escape codes are used to control text formatting and other output options on text terminals. In this context, ANSI refers to the ANSI X3.64 standard (which was withdrawn in 1997). It was replaced by ISO/IEC 6429, and is equivalent to ECMA-48.
..... Click the link for more information.
Mime or pantomime is a theatrical medium or performance art, involving the acting out of a story by a mime artist through body motions, without use of speech.

History


..... Click the link for more information.
ISO 8859-11 is an 8-bit character encoding, part of the ISO 8859 standard. It covers the characters used for the Thai language. It was added in 1999 to the ISO 8859 standard, and is nearly identical to the national Thai standard TIS-620 (1990), the sole difference being that ISO
..... Click the link for more information.
Thai Industrial Standard 620-2533, commonly referred to as TIS-620, is the most common character set and character encoding for the Thai language. The standard is published by the Thai Industrial Standards Institute (TISI), an organ of the Ministry of Industry under the
..... Click the link for more information.
Typography is the art and techniques of type design, modifying type glyphs, and arranging type. Type glyphs (characters) are created and modified using a variety of illustration techniques.
..... Click the link for more information.
American Standard Code for Information Interchange (ASCII), generally pronounced ask-ee IPA: /ˈæski/ ( [1] ), is a character encoding based on the English alphabet.
..... Click the link for more information.
Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard
..... Click the link for more information.
ISO 8859-15 is part 15 of ISO 8859, a standard character encoding defined by International Organization for Standardization. It is also known as Latin-9, and unofficially as Latin-0 but not as Latin-15.
..... Click the link for more information.
' )
brackets ( ), [ ], , < >
colon ( : )
comma ( , )
dashes ( , , , )
ellipsis ( , ...
..... Click the link for more information.

page counter