10. String and &str in Depth — UTF-8, indexing
Rust strings are UTF-8 encoded. `String` is a heap-owned mutable buffer, `&str` is a reference view into part of it. This lesson explains why integer indexing is intentionally disallowed and how `.chars()` / `.bytes()` / `.char_indices()` give you explicit control over Unicode iteration.
What you'll learn
- 1Describe the memory layouts of String and &str
- 2Build Strings and extend them with `push_str` / `push` / `+`
- 3Explain why integer indexing on strings is blocked
- 4Pick between `.chars()` / `.bytes()` / `.char_indices()`
- 5Compose strings with the `format!` macro
Overview
If you're used to `s[0]` for the first character, Rust will feel awkward at first — that doesn't compile. The reason is honest: in UTF-8 a character is 1 to 4 bytes long, so an integer index is fundamentally ambiguous (bytes? characters?). Rust refuses to guess and gives you `.chars()` instead.
Core Concepts
1) String's memory layout
Internally a `Vec<u8>` — heap-owned, growable bytes. The bytes are UTF-8 encoded; methods enforce safe access.
2) &str's memory layout
(data pointer, byte length) — fat pointer. Can point to a String, a `'static` literal, or someone else's memory.
3) UTF-8 variable width
| Character | Bytes | Example |
|---|---|---|
| ASCII | 1 | 'a' = 0x61 |
| Korean / CJK | 3 | '한' = 0xED 0x95 0x9C |
| Some emoji | 4 | '😀' = 0xF0 0x9F 0x98 0x80 |
`s[0]` would be "first byte" or "first char" — ambiguous → compile error.
4) Iteration choices
- **.chars()** — per-character (char)
- **.bytes()** — per-byte (u8)
- **.char_indices()** — (byte index, char) pairs
Hands-on Examples
Building and concatenating strings:
fn main() {
let mut s = String::new();
s.push_str("hello");
s.push(' ');
s.push_str("world");
println!("{}", s); // hello world
let a = String::from("Hello, ");
let b = String::from("world!");
let c = a + &b; // a is moved, b is borrowed
println!("{}", c);
}Per-character processing:
fn main() {
let s = "안녕 hi😀";
println!("byte length: {}", s.len()); // 13
println!("char count: {}", s.chars().count()); // 6
for c in s.chars() { print!("[{}]", c); }
println!();
}`format!` — the idiomatic string builder:
fn main() {
let name = "Rust";
let n = 22;
let msg = format!("{} track {} lessons", name, n);
println!("{}", msg);
}Common Mistakes
Q. Why can't I grab the first character with s[0]?
A. You can't. Use `.chars().next().unwrap()` or `.chars().nth(0).unwrap()`. For pure ASCII you can index bytes with `.as_bytes()[0]`.
Q. Adding two Strings with + made the first one disappear
A. `+` takes ownership of the left side (move). To keep both alive, use `format!("{}{}", a, b)` or `.clone()`.
Q. The length of a Korean string looks wrong
A. `.len()` is **byte length**. For character count use `.chars().count()`. Mixing these up is a common source of panics in Korean text processing.
Recap
- String = heap-owned Vec<u8>, &str = fat-pointer view
- UTF-8 variable width means integer indexing is blocked on purpose
- Use .chars() / .bytes() / .char_indices() for explicit iteration
- `format!` is the idiomatic way to build a new String
Try It Yourself
- Read a Korean sentence and print byte length vs. character count
- Write `fn reverse(s: &str) -> String` using `.chars().rev()`
- Split a string on spaces and uppercase the first letter of each word
All lecture materials and example code are openly available on GitHub.
View on GitHub ↗