Character Sets
- Text is a collection of characters that can be represented in binary, which is the language that computers use to process information
- To represent text in binary, a computer uses a character set, which is a collection of characters and the corresponding binary codes that represent them
- One of the most commonly used character sets is the American Standard Code for Information Interchange (ASCII), which assigns a unique 7-bit binary code to each character, including uppercase and lowercase letters, digits, punctuation marks, and control characters
- E.g. The ASCII code for the uppercase letter 'A' is 01000001, while the code for the character '?' is 00111111
- ASCII has limitations in terms of the number of characters it can represent, and it does not support characters from languages other than English
- To address these limitations, Unicode was developed as a character encoding standard that allows for a greater range of characters and symbols than ASCII, including different languages and emojis
- Unicode uses a variable-length encoding scheme that assigns a unique code to each character, which can be represented in binary form using multiple bytes
- E.g. The Unicode code for the heart symbol is U+2665, which can be represented in binary form as 11100110 10011000 10100101
- As Unicode requires more bits per character than ASCII, it can result in larger file sizes and slower processing times when working with text-based data