As we go into the New Year, the Unicode team thought we’d share
some highlights from this past year. From source-code spoofing to preserving
indigenous languages, the Unicode team has had another full year, including
expanding the number of characters that appear on billions of devices
around the world.
Nearly 150,000 characters!
On the character side, we reached a total of just shy of 150,000
characters (149,186 to be exact). Of the 4,489 characters added in the 15.0
release, the biggest set was 4,192 ideographs for use in Chinese, Japanese, and
Korean. There are also two new scripts, Nag Mundari and Kawi. Nag Mundari is a
script used to write the Mundari language of India, a language with 1.1 million
speakers. Kawi is an important historic script of insular Southeast Asia, found
in inscriptions and on artifacts in several languages dating from the 8th to the
16th centuries — and is undergoing a revival today amongst enthusiasts.
And we can’t forget the 20
emoji characters — we’re looking forward to seeing which are the most
popular: shaking face? Goose? Maracas? Pink heart? If you’re involved in
implementing emoji, you’ll also want to look at latest changes in
#51 Unicode Emoji.
The Launch of ICU4X
ICU is used in every major device and operating system; it’s how
you see a date or number on your phone, for example. This new project,
ICU4X, was created to solve the needs of clients who wish to provide client-side
internationalization for their products in resource-constrained environments and
across many programming languages. After 2½ years of work by Google,
Mozilla, Amazon, and community partners, the Unicode Consortium has published
ICU4X 1.0, its first stable release. Built from the ground up to be lightweight,
portable, and secure, ICU4X learns from decades of experience to bring localized
date formatting, number formatting, collation, text segmentation, and more to
devices that, until now, did not have a suitable solution. For details, see
When does i ≠ і?
Can you tell the difference between i and і? Yeah, most people
can’t. The first set of changes to help counter source-code spoofing were
included in the 15.0 versions of the
Unicode Bidirectional Algorithm,
#31 Unicode Identifier and Pattern Syntax, and
#39 Unicode Security Mechanisms.
For 2023, there is a new draft
UTS #55 Unicode
Source Code Handling, providing guidance for programming language designers
and tooling developers, and specifying mechanisms to avoid usability and
security issues arising from improper handling of Unicode. More changes are on
their way for UAX #9,
UAX #31, and
UTS #39 as well.
Åge Møller, Πέτρος Νικόλαος Καρατζής, ராஜேந்திர சோழன்
We’re making great progress on internationalized formatting of
people’s names. What does that mean? Software needs to be able to format
people’s names, such as John Smith or 宮崎駿. The formatting can be surprisingly
complicated: for example, people may have a different number of names, depending
on their culture — they might have only one name (“Zendaya”), only two (“Albert
Einstein”), or three or more. So the software needs to handle missing or extra
name fields gracefully.
There are many more complexities — for more details, see
Formatting people’s names.
You have 2 unread messages.
Or, you have 3 items in your cart. Whenever a computer needs to
construct a sentence using “placeholders” such as 3, it is formatting a message.
The current industry standard is ICU’s message formatting; a project started
3 years ago, with the goal of improving on that to build a more robust and
extensible mechanism. There is now a Tech Preview in ICU — we’d urge developers
to try it out!
message-format-wg for details on the syntax and
message2/package-summary.html for the API (note that the ICU’s convention
for tech previews is to mark as Deprecated), and the test code in
MessageFormat2Test.java for examples of usage.
Māori, Wolof, тоҷикӣ, کٲشُر, ትግርኛ, कॉशुर, মৈতৈলোন্, ᱥᱟᱱᱛᱟᱲᱤ
In CLDR, we now have 95 languages at the Modern level (suitable for
full UI internationalization), 6 at the Moderate level (suitable for “document
content” internationalization), and 29 at the Basic level (suitable for locale
selection). We added a tech preview of formatting for person names, plus
additions for Unicode 15.0 (emoji names and search keywords), names for new
scripts, new CJK collation, and so on. For more information, see
Revitalization and Preservation of Indigenous Languages
The Nattilik language community was unable to use their language
reliably for even simple, everyday digital text exchanges such as email or text
messaging. The Typotheque Syllabics Project, an initiative based out of Toronto
and The Hague, Netherlands, undertook research with language keepers across
various Syllabics-using Indigenous communities in Canada. By collaborating with
Nattilik language keepers and elders in the community, key issues the Nattilik
community of Western Nunavut faced were identified, and it was discovered that
there were 12 missing syllabic characters from the Unicode Standard. The
Consortium worked with the Typotheque Syllabics Project to add 16 characters to
the script to support Nattilik and other languages in Unicode version
14.0, and improved the glyphs in Unicode version 15.0. See
this blog post from June.
The Past and Future of Flag Emoji
Despite being the largest emoji category with a strong association
tied to identity, flags are
by far the least used.
Flag emoji have always been subject to special criteria due to their open-ended
nature, infrequent use, and burden on implementations. The addition of other
flags and thousands of valid sequences into the Unicode Standard has not
resulted in wider adoption. They don’t stand still, are constantly evolving, and
due to the open-ended nature of flags, the addition of one creates exclusivity
at the expense of others. Curious to learn more?
Read more about the Past and Future of Flag Emoji.
Available Now! New YouTube Playlist and Technical Quick Start Guide
On September 28th, Unicode held a webinar on the “Overview of
Internationalization and Unicode Projects” for Unicode enthusiasts. Unicode
technical leadership and other experts shared background on our core projects
with participants from more than 30 countries. If you missed the webinar,
no worries! The recorded sessions are available on this
YouTube playlist. And if you are new to Unicode and internationalization or
simply want a refresh, you can also check out our
Start Guide. This handy guide explains what Unicode is, including answering
the question, “What is Internationalization and Why it Matters.” There are
also useful links to more detailed information and how you can get involved.
Read more here.
Support Unicode 💞💕💌💯✨🌟🤠🎁
Finally, if you are already a contributor to — or
member of Unicode (or
your company or organization is!), thank you, Danke, Děkuju, धन्यवाद,
merci, 谢谢你, grazie, நன்றி, and gracias! What we have accomplished is only
possible because of supporters like you.
And if you want to support Unicode’s mission to ensure everyone can communicate
in their languages across all devices, please consider
character, making a gift
of stock, or making a
donation. As Unicode is a US-based non-profit, 501(c)3 organization, your
contribution may be eligible for a tax deduction. Please consult with a tax
advisor for details.