Emoji for Developers: Unicode, Regex & Code Points

If you’ve ever tried to count the characters in a string containing emoji and gotten a wrong number, you know emoji are deceptively complicated under the hood. This guide covers what developers need to know: code points, surrogate pairs, regex patterns, HTML entities, and the weird world of ZWJ sequences.

How Emoji Work in Unicode

Every emoji is one or more Unicode code points. The simplest emoji are single code points in the range U+1F600 to U+1F64F (emoticons), U+1F300 to U+1F5FF (miscellaneous symbols and pictographs), and a few other blocks. But that’s just the starting point.

A "simple" emoji like 😀 is a single code point: U+1F600. Easy enough. But 👍🏽 is actually two code points: U+1F44D (thumbs up) + U+1F3FD (medium skin tone modifier). And 👨‍💻 (man technologist) is three code points joined by Zero Width Joiners: U+1F468 + U+200D + U+1F4BB.

As of Unicode 16.0, there are over 3,600 emoji including all skin tone and gender variations. The total number of code points involved is much larger when you account for modifiers and joiners.

Code Points and Surrogate Pairs in JavaScript

JavaScript strings use UTF-16 encoding internally. Most emoji have code points above U+FFFF, which means they need two 16-bit code units (a surrogate pair) to represent. This is where things get tricky:

// The length trap
"😀".length          // 2 (not 1!)
"👨‍💻".length          // 5 (not 1!)

// What's actually happening
"😀".charCodeAt(0)   // 55357 (high surrogate: 0xD83D)
"😀".charCodeAt(1)   // 56832 (low surrogate: 0xDE00)

// Getting the real code point
"😀".codePointAt(0)  // 128512 (0x1F600)

// Correct way to count "characters" (grapheme clusters)
[..."😀"].length     // 1 - spread uses the iterator, which handles surrogates
[..."👨‍💻"].length     // 3 - but still fails with ZWJ sequences!

// The proper way: Intl.Segmenter (supported in all modern browsers)
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
[...segmenter.segment("👨‍💻")].length  // 1 - correct!

The Intl.Segmenter API is the only reliable way to count user-perceived characters (grapheme clusters) in JavaScript. The spread operator gets you close but breaks on ZWJ sequences, flag sequences, and modified emoji. If you’re doing anything with string length that involves user input, reach for Intl.Segmenter.

Emoji Regex Patterns

Matching emoji with regex is notoriously hard because emoji aren’t in a single contiguous range. Here are patterns that work:

JavaScript (ES2018+)

// Using Unicode property escapes (best approach)
const emojiRegex = /\p{Emoji_Presentation}/gu;

// Match all emoji including modifiers and ZWJ sequences
const fullEmojiRegex = /\p{Emoji}(\p{Emoji_Modifier}|\uFE0F|\u200D\p{Emoji})*/gu;

// Usage
"Hello 🌍! How are you 😊?".match(emojiRegex);
// ["🌍", "😊"]

// Check if a string contains any emoji
function hasEmoji(str) {
  return /\p{Emoji_Presentation}/u.test(str);
}

// Remove all emoji from a string
function stripEmoji(str) {
  return str.replace(
    /\p{Emoji}(\p{Emoji_Modifier}|\uFE0F|\u200D\p{Emoji})*/gu,
    ""
  ).trim();
}

stripEmoji("Hello 🌍! 👋");  // "Hello !"

The u flag enables Unicode mode, and \p{Emoji_Presentation} matches characters that are displayed as emoji by default. The g flag gives you all matches, not just the first.

Fair warning: \p{Emoji} by itself matches some characters you might not expect, like digits (0-9) and # and *. That’s because these have emoji representations (0️⃣, #️⃣). Use \p{Emoji_Presentation} if you only want characters that are emoji by default.

Python

import re

# Python's regex module doesn't support \p{Emoji} natively.
# Use the 'regex' package (pip install regex) for Unicode properties:
import regex

# Match emoji with the regex package
emoji_pattern = regex.compile(r'\p{Emoji_Presentation}+', regex.UNICODE)
matches = emoji_pattern.findall("Hello 🌍! 😊")
# ['🌍', '😊']

# Without the regex package, use explicit ranges:
import re
emoji_re = re.compile(
    "["
    "\U0001F600-\U0001F64F"  # emoticons
    "\U0001F300-\U0001F5FF"  # symbols & pictographs
    "\U0001F680-\U0001F6FF"  # transport & maps
    "\U0001F1E0-\U0001F1FF"  # flags
    "\U00002702-\U000027B0"  # dingbats
    "\U0001F900-\U0001F9FF"  # supplemental symbols
    "\U0001FA00-\U0001FA6F"  # chess symbols
    "\U0001FA70-\U0001FAFF"  # symbols extended-A
    "\U00002600-\U000026FF"  # misc symbols
    "]+",
    re.UNICODE,
)
emoji_re.findall("Hello 🌍! 😊")
# ['🌍', '😊']

The regex package (not the built-in re) is the cleanest approach in Python. It supports Unicode property escapes similar to JavaScript. If you can’t install third-party packages, the explicit range approach works but needs updating as Unicode adds new emoji.

HTML Entities for Emoji

You can insert emoji in HTML using numeric character references. This is useful when you don’t want to include the literal emoji character in your source code, or when your editor or build tool might mangle Unicode:

<!-- Decimal reference -->
<p>&#128512;</p>  <!-- 😀 Grinning Face -->

<!-- Hexadecimal reference -->
<p>&#x1F600;</p>  <!-- 😀 Grinning Face -->
<p>&#x1F525;</p>  <!-- 🔥 Fire -->
<p>&#x2764;</p>   <!-- ❤ Red Heart -->

<!-- In practice, just use the UTF-8 character directly -->
<p>😀 🔥 ❤️</p>  <!-- This is fine if your charset is UTF-8 -->

<!-- Make sure your HTML declares UTF-8 -->
<meta charset="UTF-8">

In 2026, there’s rarely a reason to use HTML entities for emoji. Just make sure your document is served as UTF-8 (it almost certainly is) and paste the emoji directly. The entity approach is useful in edge cases: RSS feeds, email HTML, or any context where encoding might get mangled.

Try it, click any emoji to copy:

CSS Content Property

You can use emoji in CSS content property values for pseudo-elements. This is handy for decorative emoji that shouldn’t be in the HTML:

/* Using the emoji directly */
.warning::before {
  content: "⚠️ ";
}

/* Using Unicode escape */
.fire::before {
  content: "\1F525 ";  /* 🔥 */
}

/* Combining with text */
.new-badge::after {
  content: " \1F195";  /* 🆕 */
}

/* For emoji that need surrogate pairs in CSS */
.rocket::before {
  content: "\1F680";  /* 🚀 - CSS uses the code point directly */
}

/* Practical example: custom list markers */
ul.emoji-list {
  list-style: none;
  padding-left: 1.5em;
}
ul.emoji-list li::before {
  content: "\1F44D";  /* 👍 */
  margin-right: 0.5em;
}

In CSS, you use \1F600 (no "U+" prefix, no "0x" prefix). CSS escapes just need the hex code point value. This is different from JavaScript’s \u{1F600} or HTML’s 😀. Each format has its own syntax for the same underlying code point.

ZWJ Sequences Explained

Zero Width Joiner (U+200D) is a special character that glues emoji together to form combined characters. This is how we get emoji like 👩‍🚀 (woman astronaut), 🏳️‍🌈 (rainbow flag), and 👨‍👩‍👧‍👦 (family).

Here’s how ZWJ sequences are constructed:

// 👩‍🚀 Woman Astronaut = Woman + ZWJ + Rocket
// U+1F469 + U+200D + U+1F680
"\u{1F469}\u{200D}\u{1F680}"  // "👩‍🚀"

// 👨‍💻 Man Technologist = Man + ZWJ + Laptop
// U+1F468 + U+200D + U+1F4BB
"\u{1F468}\u{200D}\u{1F4BB}"  // "👨‍💻"

// 🏳️‍🌈 Rainbow Flag = White Flag + VS16 + ZWJ + Rainbow
// U+1F3F3 + U+FE0F + U+200D + U+1F308

// 👨‍👩‍👧‍👦 Family = Man + ZWJ + Woman + ZWJ + Girl + ZWJ + Boy
// U+1F468 + U+200D + U+1F469 + U+200D + U+1F467 + U+200D + U+1F466

// Check the actual length of these "single" emoji:
"👩‍🚀".length          // 5 in JavaScript (UTF-16 code units)
[..."👩‍🚀"].length      // 3 (code points, still not 1)
// Only Intl.Segmenter gives you 1

The important thing to understand: if a platform doesn’t support a ZWJ sequence, it’ll just display the component emoji side by side. So 👩‍🚀 would render as 👩🚀 on older systems. The ZWJ is gracefully ignored. This is by design, so new ZWJ emoji degrade naturally on older platforms instead of showing a broken character box.

Variation Selectors: Text vs. Emoji Rendering

Some characters can be displayed as either text or emoji. The heart (❤), for example, has both a text form (❤) and an emoji form (❤️). The difference? A variation selector:

// U+FE0E = Variation Selector 15 (text presentation)
// U+FE0F = Variation Selector 16 (emoji presentation)

"❤"       // U+2764 - default presentation varies by platform
"❤️"      // U+2764 + U+FE0F - explicitly emoji presentation
"❤︎"      // U+2764 + U+FE0E - explicitly text presentation

// Characters with dual presentation include:
// ☺ ☹ ☠ ♠ ♣ ♥ ♦ ♨ ♻ ⚠ ⚡ ⚽ ⚾ ✈ ✉ ✏ ❤ ❣ ✂ and many more

// In your code, be aware that "❤️" is 2 code points, not 1
"❤️".length        // 2 (heart + variation selector)
"❤".length         // 1

This matters for database storage, string comparison, and display consistency. If your app stores emoji, normalize them to a consistent form. Otherwise you might have the same visual emoji stored as different byte sequences.

Practical Tips for Handling Emoji in Production

Database: Use utf8mb4 in MySQL, not utf8. MySQL’s utf8 only supports 3-byte characters, and most emoji need 4 bytes. PostgreSQL handles this correctly by default.
String truncation: Never truncate a string in the middle of a surrogate pair or ZWJ sequence. Use Intl.Segmenter to find safe truncation points.
Input validation: If you’re validating that a field contains "only letters," decide whether emoji should pass. /^\p{Letter}+$/u excludes emoji. /^[\p{Letter}\p{Emoji_Presentation}]+$/u allows them.
URL encoding: Emoji in URLs get percent-encoded. 😀 becomes %F0%9F%98%80. This is the UTF-8 byte sequence for U+1F600. Most frameworks handle this automatically, but be aware of it for manual URL construction.
JSON: Emoji in JSON can be represented as-is (if the file is UTF-8) or as escape sequences: \uD83D\uDE00 for 😀. JSON.stringify in JavaScript preserves emoji characters directly.
Rendering: Don’t assume emoji are a fixed width. They render differently across operating systems, browsers, and fonts. A string with emoji can visually span different widths on Windows vs. macOS vs. Android.

Useful Libraries and Tools

Instead of reinventing emoji handling, consider these battle-tested tools:

emoji-regex (npm): A maintained regex pattern that matches all Unicode emoji. Updated with each Unicode release. Way more reliable than writing your own.
grapheme-splitter or Intl.Segmenter: For accurate character counting. Intl.Segmenter is native and preferred now that browser support is universal.
twemoji (Twitter/X): Renders emoji as consistent SVG/PNG images across all platforms. Useful if you need pixel-perfect emoji that look the same everywhere.
unicode-emoji-json (npm): A comprehensive JSON dataset of all emoji with names, categories, groups, and code points. Great for building custom emoji pickers.

2,000+

Emojis researched

✍️

Hand-written articles

0 ads

Ever. Completely free.