Discussion:
Unicode strings and invalid codepoints
Chris Angelico
2015-03-09 02:52:14 UTC
Permalink
There's a bit of a discussion happening on python-list about whether
or not it should be legal to have codepoints like U+D800 in Unicode
strings. Currently, both Python and Pike permit them, but reject them
if you try to, for example, convert to UTF-8. But a suggestion has
been made that the mere presence of \uD800 in a string literal should
be a syntax error, and I'm wondering: Has anyone considered and
rejected this, or is it simply something that nobody's thought to
disallow?

Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?

ChrisA
Martin Bähr
2015-03-09 08:08:39 UTC
Permalink
Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?
strings may be just binary data.
why should binary data in a string literal be an error?

greetings, martin.
--
eKita - the online platform for your entire academic life
--
chief engineer eKita.co
pike programmer pike.lysator.liu.se caudium.net societyserver.org
secretary beijinglug.org
mentor fossasia.org
foresight developer foresightlinux.org realss.com
unix sysadmin
Martin Bähr working in china http://societyserver.org/mbaehr/
Chris Angelico
2015-03-09 08:11:00 UTC
Permalink
On Mon, Mar 9, 2015 at 7:08 PM, Martin Bähr
Post by Martin Bähr
Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?
strings may be just binary data.
why should binary data in a string literal be an error?
greetings, martin.
Indeed, so it's not safe to disallow anything 00-FF; but D800-DFFF
can't be binary data.

ChrisA
Robert J Budzynski
2015-03-09 11:09:02 UTC
Permalink
Post by Martin Bähr
Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?
strings may be just binary data.
why should binary data in a string literal be an error?
greetings, martin.
Python (v3) has different datatypes for (Unicode) strings - sequences of
characters, and byte strings. I don't think anyone has suggested that
any sort of binary data in a (byte) string should be an error.
--
RJ Budzyński
Arne Goedeke
2015-03-10 10:09:39 UTC
Permalink
I think it could be a compat problem, someone might be using wide
strings for non unicode data (e.g. as an efficient way to store
integers for a bitmask). The other issue is that implementing this
correctly would require checking all chars when hashing the string,
which would probably make it rather slow for wide strings.

Whats the reasoning for restricting strings in Python? What happens when
a new language is added to Unicode and the next pike release is still a
decade away?

arne
Post by Chris Angelico
There's a bit of a discussion happening on python-list about whether
or not it should be legal to have codepoints like U+D800 in Unicode
strings. Currently, both Python and Pike permit them, but reject them
if you try to, for example, convert to UTF-8. But a suggestion has
been made that the mere presence of \uD800 in a string literal should
be a syntax error, and I'm wondering: Has anyone considered and
rejected this, or is it simply something that nobody's thought to
disallow?
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?
ChrisA
Chris Angelico
2015-03-10 10:44:46 UTC
Permalink
Post by Arne Goedeke
I think it could be a compat problem, someone might be using wide
strings for non unicode data (e.g. as an efficient way to store
integers for a bitmask). The other issue is that implementing this
correctly would require checking all chars when hashing the string,
which would probably make it rather slow for wide strings.
Fair enough. I'm more looking for philosophical arguments; backward
compatibility is, of course, arguing for no change.
Post by Arne Goedeke
Whats the reasoning for restricting strings in Python? What happens when
a new language is added to Unicode and the next pike release is still a
decade away?
Nothing there; unallocated codepoints would be perfectly acceptable,
to ensure forward compatibility. It's only the blocks that are
strictly disallowed (such as surrogates) which would be forbidden.

There's currently no plan to actually make this restriction, just some
broad discussion about concepts. For instance, both Python and Pike
reject an attempt to UTF-8 encode the string "\uDD00"; but the
question was raised, is that string actually itself the problem?
Should the error have been raised earlier? Hence this query.

ChrisA
Henrik Grubbström
2015-03-16 17:30:17 UTC
Permalink
Post by Chris Angelico
There's a bit of a discussion happening on python-list about whether
or not it should be legal to have codepoints like U+D800 in Unicode
strings. Currently, both Python and Pike permit them, but reject them
if you try to, for example, convert to UTF-8. But a suggestion has
been made that the mere presence of \uD800 in a string literal should
be a syntax error, and I'm wondering: Has anyone considered and
rejected this, or is it simply something that nobody's thought to
disallow?
Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.
Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?
Yes.
--
Henrik Grubbström ***@roxen.com
Roxen Internet Software AB
Chris Angelico
2015-03-16 21:11:35 UTC
Permalink
Post by Henrik Grubbström
Post by Chris Angelico
There's a bit of a discussion happening on python-list about whether
or not it should be legal to have codepoints like U+D800 in Unicode
strings. Currently, both Python and Pike permit them, but reject them
if you try to, for example, convert to UTF-8. But a suggestion has
been made that the mere presence of \uD800 in a string literal should
be a syntax error, and I'm wondering: Has anyone considered and
rejected this, or is it simply something that nobody's thought to
disallow?
Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.
Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?
Yes.
string_to_unicode("\U00012345");
(1) Result: "\330\b\337E"
Post by Henrik Grubbström
String.string2hex(string_to_unicode("\U00012345"));
(2) Result: "d808df45"

So that's an eight-bit string.

ChrisA
Henrik Grubbström
2015-03-20 15:24:09 UTC
Permalink
Post by Chris Angelico
Post by Henrik Grubbström
Post by Chris Angelico
There's a bit of a discussion happening on python-list about whether
or not it should be legal to have codepoints like U+D800 in Unicode
strings. Currently, both Python and Pike permit them, but reject them
if you try to, for example, convert to UTF-8. But a suggestion has
been made that the mere presence of \uD800 in a string literal should
be a syntax error, and I'm wondering: Has anyone considered and
rejected this, or is it simply something that nobody's thought to
disallow?
Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.
Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?
Yes.
string_to_unicode("\U00012345");
(1) Result: "\330\b\337E"
Post by Henrik Grubbström
String.string2hex(string_to_unicode("\U00012345"));
(2) Result: "d808df45"
Well, take UCS-2 then.
--
Henrik Grubbström ***@roxen.com
Roxen Internet Software AB
Chris Angelico
2015-03-20 17:08:24 UTC
Permalink
Post by Henrik Grubbström
Post by Chris Angelico
Post by Henrik Grubbström
Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.
Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?
Yes.
string_to_unicode("\U00012345");
(1) Result: "\330\b\337E"
Post by Henrik Grubbström
String.string2hex(string_to_unicode("\U00012345"));
(2) Result: "d808df45"
Well, take UCS-2 then.
Which, I believe, disallows U+D800, bringing us back to the start.

ChrisA
Fredrik Hubinette
2015-03-20 17:27:14 UTC
Permalink
Pike strings are arrays of 32-bit numbers.
Some functions assume that they contain unicode characters, most don't.

What you suggest requires implementing a way to type strings depending on
their content, and then enforce the validity of the content based on the
type. Doing so would seem to be a lot of work for very little gain. How
many minutes/hours of developer time have you personally lost because pike
didn't detect U+D800 in unicode strings early enough?

/Hubbe
Post by Chris Angelico
Post by Henrik Grubbström
Post by Chris Angelico
Post by Henrik Grubbström
Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.
Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?
Yes.
string_to_unicode("\U00012345");
(1) Result: "\330\b\337E"
Post by Henrik Grubbström
String.string2hex(string_to_unicode("\U00012345"));
(2) Result: "d808df45"
Well, take UCS-2 then.
Which, I believe, disallows U+D800, bringing us back to the start.
ChrisA
Chris Angelico
2015-03-20 17:31:03 UTC
Permalink
Post by Fredrik Hubinette
Pike strings are arrays of 32-bit numbers.
Some functions assume that they contain unicode characters, most don't.
What you suggest requires implementing a way to type strings depending on
their content, and then enforce the validity of the content based on the
type. Doing so would seem to be a lot of work for very little gain. How many
minutes/hours of developer time have you personally lost because pike didn't
detect U+D800 in unicode strings early enough?
That was where this question started. Is it something that's simply
"not worth the effort of disallowing", or is there actually a solid
use-case for needing those codepoints? If the easy implementation had
been to disallow them, would it have been worth putting effort into
allowing them?

So far, all I'm seeing is that it's not going to make any difference
either way, so the better option is the easier one - ie check nothing,
and permit them all. Haven't heard from anyone who actually needs
them.

ChrisA
Fredrik Hubinette
2015-03-20 17:46:48 UTC
Permalink
When working with arrays of integers, strings are much more efficient than
array(int). They use less memory, are generally faster and can share memory
of many identical strings are created. This makes them ideal for storing
byte data (files), audio samples, vector graphics and many other things.
If array(int) was more cleverly implemented (I made some early attempts at
this, but eventually gave up) then the use case for string-as-data pretty
much goes away. In fact, if array() was smart enough, then string would be
an alias for array(UCS32Character).

/Hubbe
Post by Fredrik Hubinette
Post by Fredrik Hubinette
Pike strings are arrays of 32-bit numbers.
Some functions assume that they contain unicode characters, most don't.
What you suggest requires implementing a way to type strings depending on
their content, and then enforce the validity of the content based on the
type. Doing so would seem to be a lot of work for very little gain. How
many
Post by Fredrik Hubinette
minutes/hours of developer time have you personally lost because pike
didn't
Post by Fredrik Hubinette
detect U+D800 in unicode strings early enough?
That was where this question started. Is it something that's simply
"not worth the effort of disallowing", or is there actually a solid
use-case for needing those codepoints? If the easy implementation had
been to disallow them, would it have been worth putting effort into
allowing them?
So far, all I'm seeing is that it's not going to make any difference
either way, so the better option is the easier one - ie check nothing,
and permit them all. Haven't heard from anyone who actually needs
them.
ChrisA
Henrik Grubbström
2015-03-20 18:05:32 UTC
Permalink
Post by Chris Angelico
Post by Henrik Grubbström
Post by Chris Angelico
Post by Henrik Grubbström
Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.
Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?
Yes.
For transport, yes. When actually used (cf NT) it is an array of
16-bit integers.
Post by Chris Angelico
Post by Henrik Grubbström
Post by Chris Angelico
Post by Henrik Grubbström
string_to_unicode("\U00012345");
(1) Result: "\330\b\337E"
Post by Henrik Grubbström
String.string2hex(string_to_unicode("\U00012345"));
(2) Result: "d808df45"
Well, take UCS-2 then.
Which, I believe, disallows U+D800, bringing us back to the start.
Not quite. U+D800 (and the other surrogates) are intended to use to
encode the full Unicode set with 16-bit integers, which is a perfectly
good use of wide strings.
--
Henrik Grubbström ***@grubba.org
Roxen Internet Software AB ***@roxen.com
Loading...