Strings, Character Arrays, and Encoding

C Strings

Before this new string data type from C++, C only had character arrays. In C, a string is only a char array with the last char equal to 0.

To get the length of a string, you would loop through all characters until you hit 0. To add a character to the string, you just write one more character at the end, making sure that the new end is 0 again. It’s all very simple. And if you have a C++ string, you can get a pointer to the original C string using s.c_str().

The reason we bring this up is because we want to talk about how characters are represented in computers.

Computers only work with numbers. They know how to copy, add, subtract, and so on. But characters are not numbers. What we really need to do is to decide a mapping from each number to a character. Thankfully, other people have decided that mapping for us. The most important mapping of numbers to characters to learn is called ASCII. There are tables online for ASCII, but memorizing the table doesn’t have very much use. It’s enough to know that the mapping exists and that’s computers will look up characters from the table to decide what to display on your screen.

By looking up some values on the table, we can learn that 20 maps to the space character. 10 maps to the new line character (the "enter" character). 65 maps to A. 97 maps to a. In general, ASCII maps numbers from the range 0 to 127 into characters, and this is enough to fit in one byte. The numbers from 128 to 255 vary wildly, so we won’t go into that. The following code demonstrates my point about the equivalence of characters and numbers:

  1. #include <stdio.h>
  2. int main() {
  3. if ('A' == 65) {
  4. printf("'A' is equal to 65\n");
  5. }
  6. if ('z' == 122) {
  7. printf("'z' is equal to 122\n");
  8. }
  9. printf("Let's print the %c character\n", 65);
  10. }

You might notice from this that the 'x' notation for characters is just convenience for letting the compiler look up the number for you during the compilation stage. There’s not very much reason to look up the character yourself when the compiler can do it, but the equivalence of letters and numbers allows you to do operations on characters. The ASCII table was designed for a purpose, so it’s not just completely jumbled garbage. Digits are consecutive on that table, and so are capital letters and small letters. For example, if you wanted to capitalize letters, you just need to do some arithmetic to convert the range 97-122 ('a' to 'z') to 65-90 ('A' to 'Z').

As an aside, not all schemes will convert one byte into one character. There are encodings out there such as UTF-32 that takes 4 bytes per character and looks up the character on a table called Unicode which has characters in Chinese and Japanese and even emojis. I won’t go into all the details of the different types of encodings and lookup tables because they’re no use in programming contests. All that’s important is knowing that the one byte char in C refer to one character each by looking up the ASCII table.

Sample String Upper Case

To demonstrate what I mean about using ASCII with arithmetic, here is some example code that capitalizes the input line.

  1. #include <iostream>
  2. using namespace std;
  3. int main() {
  4. char string[1000];
  5. cin.getline(string, 1000);
  6. for (int i=0; i<1000; i++) {
  7. // Terminating character
  8. if (string[i] == 0) {
  9. break;
  10. }
  11. // if within the range of 'a' to 'z'
  12. if ('a' <= string[i] && string[i] <= 'z') {
  13. // Subtract 'a' to make it 0 to 25, then add 'A' to capitalize.
  14. string[i] = string[i] - 'a' + 'A';
  15. }
  16. // equivalent to the code above
  17. // if (97 <= string[i] && string[i] <= 122) {
  18. // string[i] = string[i] - 97 + 65;
  19. // }
  20. }
  21. cout << string << endl;
  22. }

Problems