Unicode strings and invalid codepoints

Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?

strings may be just binary data.
why should binary data in a string literal be an error?

greetings, martin.
--
eKita - the online platform for your entire academic life

--
chief engineer eKita.co
pike programmer pike.lysator.liu.se caudium.net societyserver.org
secretary beijinglug.org
mentor fossasia.org
foresight developer foresightlinux.org realss.com
unix sysadmin
Martin Bähr working in china http://societyserver.org/mbaehr/

Chris Angelico

2015-03-09 08:11:00 UTC

On Mon, Mar 9, 2015 at 7:08 PM, Martin Bähr

Post by Martin BÃ¤hr

Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?

strings may be just binary data.
why should binary data in a string literal be an error?
greetings, martin.

Indeed, so it's not safe to disallow anything 00-FF; but D800-DFFF
can't be binary data.

ChrisA

Robert J Budzynski

2015-03-09 11:09:02 UTC

Post by Martin BÃ¤hr

Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?

strings may be just binary data.
why should binary data in a string literal be an error?
greetings, martin.

Python (v3) has different datatypes for (Unicode) strings - sequences of
characters, and byte strings. I don't think anyone has suggested that
any sort of binary data in a (byte) string should be an error.

--
RJ Budzyński

Arne Goedeke

2015-03-10 10:09:39 UTC

I think it could be a compat problem, someone might be using wide
strings for non unicode data (e.g. as an efficient way to store
integers for a bitmask). The other issue is that implementing this
correctly would require checking all chars when hashing the string,
which would probably make it rather slow for wide strings.

Whats the reasoning for restricting strings in Python? What happens when
a new language is added to Unicode and the next pike release is still a
decade away?

arne

Post by Chris Angelico
There's a bit of a discussion happening on python-list about whether
or not it should be legal to have codepoints like U+D800 in Unicode
strings. Currently, both Python and Pike permit them, but reject them
if you try to, for example, convert to UTF-8. But a suggestion has
been made that the mere presence of \uD800 in a string literal should
be a syntax error, and I'm wondering: Has anyone considered and
rejected this, or is it simply something that nobody's thought to
disallow?
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?
ChrisA

Chris Angelico

2015-03-10 10:44:46 UTC

Post by Arne Goedeke
I think it could be a compat problem, someone might be using wide
strings for non unicode data (e.g. as an efficient way to store
integers for a bitmask). The other issue is that implementing this
correctly would require checking all chars when hashing the string,
which would probably make it rather slow for wide strings.

Fair enough. I'm more looking for philosophical arguments; backward
compatibility is, of course, arguing for no change.

Post by Arne Goedeke
Whats the reasoning for restricting strings in Python? What happens when
a new language is added to Unicode and the next pike release is still a
decade away?

Nothing there; unallocated codepoints would be perfectly acceptable,
to ensure forward compatibility. It's only the blocks that are
strictly disallowed (such as surrogates) which would be forbidden.

There's currently no plan to actually make this restriction, just some
broad discussion about concepts. For instance, both Python and Pike
reject an attempt to UTF-8 encode the string "\uDD00"; but the
question was raised, is that string actually itself the problem?
Should the error have been raised earlier? Hence this query.

ChrisA

Henrik Grubbström

2015-03-16 17:30:17 UTC

Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.

Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?

Yes.

--
Henrik GrubbstrÃ¶m ***@roxen.com
Roxen Internet Software AB

Chris Angelico

2015-03-16 21:11:35 UTC

Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.

Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?

Yes.
string_to_unicode("\U00012345");

(1) Result: "\330\b\337E"

Post by Henrik GrubbstrÃ¶m
String.string2hex(string_to_unicode("\U00012345"));

(2) Result: "d808df45"

So that's an eight-bit string.

ChrisA

Henrik Grubbström

2015-03-20 15:24:09 UTC

Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.

Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?

Yes.
string_to_unicode("\U00012345");

(1) Result: "\330\b\337E"

Post by Henrik GrubbstrÃ¶m
String.string2hex(string_to_unicode("\U00012345"));

(2) Result: "d808df45"

Well, take UCS-2 then.

--
Henrik GrubbstrÃ¶m ***@roxen.com
Roxen Internet Software AB

Chris Angelico

2015-03-20 17:08:24 UTC

Post by Henrik GrubbstrÃ¶m
Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.

Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?

Yes.
string_to_unicode("\U00012345");

(1) Result: "\330\b\337E"

Post by Henrik GrubbstrÃ¶m
String.string2hex(string_to_unicode("\U00012345"));

(2) Result: "d808df45"

Well, take UCS-2 then.

Which, I believe, disallows U+D800, bringing us back to the start.

ChrisA

Fredrik Hubinette

2015-03-20 17:27:14 UTC

Pike strings are arrays of 32-bit numbers.
Some functions assume that they contain unicode characters, most don't.

What you suggest requires implementing a way to type strings depending on
their content, and then enforce the validity of the content based on the
type. Doing so would seem to be a lot of work for very little gain. How
many minutes/hours of developer time have you personally lost because pike
didn't detect U+D800 in unicode strings early enough?

/Hubbe

Post by Henrik GrubbstrÃ¶m
Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.

Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?

Yes.
string_to_unicode("\U00012345");

(1) Result: "\330\b\337E"

Post by Henrik GrubbstrÃ¶m
String.string2hex(string_to_unicode("\U00012345"));

(2) Result: "d808df45"

Well, take UCS-2 then.

Which, I believe, disallows U+D800, bringing us back to the start.
ChrisA

Chris Angelico

2015-03-20 17:31:03 UTC

Post by Fredrik Hubinette
Pike strings are arrays of 32-bit numbers.
Some functions assume that they contain unicode characters, most don't.
What you suggest requires implementing a way to type strings depending on
their content, and then enforce the validity of the content based on the
type. Doing so would seem to be a lot of work for very little gain. How many
minutes/hours of developer time have you personally lost because pike didn't
detect U+D800 in unicode strings early enough?

That was where this question started. Is it something that's simply
"not worth the effort of disallowing", or is there actually a solid
use-case for needing those codepoints? If the easy implementation had
been to disallow them, would it have been worth putting effort into
allowing them?

So far, all I'm seeing is that it's not going to make any difference
either way, so the better option is the easier one - ie check nothing,
and permit them all. Haven't heard from anyone who actually needs
them.

ChrisA

Fredrik Hubinette

2015-03-20 17:46:48 UTC

When working with arrays of integers, strings are much more efficient than
array(int). They use less memory, are generally faster and can share memory
of many identical strings are created. This makes them ideal for storing
byte data (files), audio samples, vector graphics and many other things.
If array(int) was more cleverly implemented (I made some early attempts at
this, but eventually gave up) then the use case for string-as-data pretty
much goes away. In fact, if array() was smart enough, then string would be
an alias for array(UCS32Character).

/Hubbe

Post by Fredrik Hubinette

many

Post by Fredrik Hubinette
minutes/hours of developer time have you personally lost because pike

didn't

Post by Fredrik Hubinette
detect U+D800 in unicode strings early enough?

That was where this question started. Is it something that's simply
"not worth the effort of disallowing", or is there actually a solid
use-case for needing those codepoints? If the easy implementation had
been to disallow them, would it have been worth putting effort into
allowing them?
So far, all I'm seeing is that it's not going to make any difference
either way, so the better option is the easier one - ie check nothing,
and permit them all. Haven't heard from anyone who actually needs
them.
ChrisA

Henrik Grubbström

2015-03-20 18:05:32 UTC

Post by Henrik GrubbstrÃ¶m
Well, as U+D800 is legal in UTF-16 strings (which BTW is why it is
reserved), I don't see a reason to prohibit it.

Post by Chris Angelico
Are there situations in which it's necessary to be able to store these
kinds of noncharacters in a string?

Yes.

For transport, yes. When actually used (cf NT) it is an array of
16-bit integers.