Хабр Курсы для всех
РЕКЛАМА
Практикум, Хекслет, SkyPro, авторские курсы — собрали всех и попросили скидки. Осталось выбрать!
return Is<LETTER>(code);
// Go to loop of UTF-8 sequence reading.
uint32_t decoded_symbol = 0;
switch (extra_bytes_to_read) {
case 5:
symbol_.chain_[0] = *iter_;
decoded_symbol += static_cast<uint8_t>(*iter_);
decoded_symbol <<= 6;
if (not Next()) {
return;
}
case 4:
symbol_.chain_[extra_bytes_to_read-4] = *iter_;
decoded_symbol += static_cast<uint8_t>(*iter_);
decoded_symbol <<= 6;
if (not Next()) {
return;
}
case 3:
symbol_.chain_[extra_bytes_to_read-3] = *iter_;
decoded_symbol += static_cast<uint8_t>(*iter_);
decoded_symbol <<= 6;
if (not Next()) {
return;
}
case 2:
symbol_.chain_[extra_bytes_to_read-2] = *iter_;
decoded_symbol += static_cast<uint8_t>(*iter_);
decoded_symbol <<= 6;
if (not Next()) {
return;
}
case 1:
symbol_.chain_[extra_bytes_to_read-1] = *iter_;
decoded_symbol += static_cast<uint8_t>(*iter_);
decoded_symbol <<= 6;
if (not Next()) {
return;
}
case 0:
symbol_.chain_[extra_bytes_to_read] = *iter_;
decoded_symbol += static_cast<uint8_t>(*iter_);
}
// Magic numbers to process decoding.
static const uint32_t OFFSETS_FROM_UTF8[6] = {
0x00000000UL, 0x00003080UL, 0x000E2080UL, 0x03C82080UL, 0xFA082080UL, 0x82082080UL
};
decoded_symbol -= OFFSETS_FROM_UTF8[extra_bytes_to_read];
symbol_.len_ = extra_bytes_to_read + 1;
// Is the sequence legal?
if (IsLegalUtf8((const uint8_t*)symbol_.chain_, symbol_.len_)) {
// Increase symbol counter only, if correct symbol extracted.
symbol_.utf32_ = decoded_symbol;
++sym_pos_;
}
2FA1C;CJK COMPATIBILITY IDEOGRAPH-2FA1C;Lo;0;L;9F3B;;;;N;;;;;
2FA1D;CJK COMPATIBILITY IDEOGRAPH-2FA1D;Lo;0;L;2A600;;;;N;;;;;
E0001;LANGUAGE TAG;Cf;0;BN;;;;;N;;;;;
E0020;TAG SPACE;Cf;0;BN;;;;;N;;;;;
E0021;TAG EXCLAMATION MARK;Cf;0;BN;;;;;N;;;;;
10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;;
After converting all of SpiderMonkey’s string code, I had to make Gecko work with Latin1 JS strings and unstable string characters. Gecko has its own TwoByte string types and in many cases it used to avoid copying the JS characters by using a nsDependentString.
8.4
The String Type
The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values (―elements‖). The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a code unit value (see Clause 6). Each element is regarded as occupying a position within the sequence. These positions are indexed with nonnegative integers. The first element (if any) is at position 0, the next element (if any) at position 1, and so on. The length of a String is the number of elements (i.e., 16-bit values) within it. The empty String has length zero and therefore contains no elements.
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit. Whether or not this is the actual storage format of a String, the characters within a String are numbered by their initial code unit element position as though they were represented using UTF-16. All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results.
Conclusion
JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.
Библиотека Strutext обработки текстов на языке C++