Omgili, forum search, forums search, search forums, discussion search,discussions search, search discussions, board search, boards search, search boards
  Advanced Search

[boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

On Sat, 28 Jan 2012 11:46:25 -0500, Beman Dawes <...@acm.org

Beman.github.com/string-interoperability/interop_white_paper.html
describes Boost components intended to ease string interoperability in
general and Unicode string interoperability in particular.

These proposals are the Boost version of the TR2 proposals made in
N3336, Adapting Standard Library Strings and I/O to a Unicode World.
See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html.

I'm very interested in hearing comments about either the Boost or the
TR2 proposal. Are these useful additions? Is there a better way to
achieve the same easy interoperability goals?

Where is the best home for the Boost proposals? A separate library?
Part of some existing library?

Are these proposals orthogonal to the need for deeper Unicode
functionality, such as Mathias Gaunard's Unicode components?

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost



On Sun, 29 Jan 2012 02:12:34 +0100, Mathias Gaunard <...@ens-lyon.org

I think you should consider the points being made in N3334.
While that proposal is in my opinion not good enough, it raises an
important issue that is often present with std::string-based or similar
designs.

A function that takes a std::string, or a boost::filesystem::path for
that matter, necessarily causes the callee to copy the data into a
heap-allocated buffer, even if there is no need to.

Use of the range concept would solve that issue, but then that requires
making the function a template. A type-erased range would be possible,
but that has significant performance overhead.
a string_ref or path_ref is maybe the lesser evil.

It seems all you really care about is having iterator adaptors that do
character set conversion, allowing to lazily convert any range of any
encoding to a particular Unicode encoding.
This has always been the goal of my library, which somewhat provides
that along with more advanced Unicode features. Those two things could
live separately though.

For standardization, the problem with iterator adaptors is that they
cannot be as fast as free functions operating on pointers, unless the
optimizer is pretty darn good. The conversion algorithms are also fully
template and cannot be put in the library binary.
Those are disadvantages compared to the mechanisms that exist today in
the standard.

By the way you only have input iterator adaptors. In my library I've
implemented bidirectional iterator adaptors and output iterator adaptors.
You've only been considering input, but output can also be useful
depending on the situation.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, 30 Jan 2012 09:50:42 -0500, Beman Dawes <...@acm.org

On Sat, Jan 28, 2012 at 8:12 PM, Mathias Gaunard
<...@ens-lyon.org
Ah, thanks! Yes, that's a very interesting proposal. I've started a
separate thread to discuss it, so won't repeat that discussion here.

Yes, that's a fair summary.

I'm still feeling my way. I'd actually prefer to leave the encoding
conversion to someone else. It's like my POD relaxation proposal that
went into C++11 - I really didn't feel qualified to do that work, but
none of the experts stepped forward. So I got sucked into the problem.

Yes, but the optimizers are often "pretty darn good", and iterator
adapters are very flexible.

That may well be correct for the general algorithms, but I'd be
surprised if specializations for the most common cases couldn't call
down to compiled binary functions.

There is a do list work item to implement bidirectional iterator
adapters. And output iterator adapters are worth some work too.

Thanks for your comments,

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 02:22:00 +0100, Mathias Gaunard <...@ens-lyon.org

The caller, rather.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 08:21:55 -0000, "Keith Burton" <...@xtramax.co.uk

-----Original Message-----
[snip]
These proposals are the Boost version of the TR2 proposals made in
N3336, Adapting Standard Library Strings and I/O to a Unicode World.
See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html.

I'm very interested in hearing comments about either the Boost or the
TR2 proposal

[snip]
-----Original Message-----

Beman

I do not understand how the converting c_str template can be useful in what
for me, is the normal usage of the c_str function.

Given existing code

std::string stdstr;
const char * cstr = stdstr.c_str();

third_party_api( cstr );

and moving to general use of a wide string type e.g.

std::u32string stdstr;
const char * cstr = stdstr.c_str< char
third_party_api( cstr );

clearly it is possible to make third_part_api( stdstr.c_str< char

Keith Burton

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 09:43:36 -0500, Beman Dawes <...@acm.org

On Sun, Jan 29, 2012 at 3:21 AM, Keith Burton <...@xtramax.co.uk
That's a compile time error. The unspecified iterator type returned
will not be const char*. It will be a conversion iterator with a value
type of char, and thus only useful directly in purpose written code or
in generic algorithms templated on iterator type.

One possible problem with conversion iterators with a value type of
char is that they can be passed to functions that don't work with
UTF-8 encoded data because of its multibyte nature. But UTF-8 is so
craftily designed that many functions do work as intended, even though
the functions were designed without multibyte encodings in mind.

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, 30 Jan 2012 07:04:51 -0000, "Keith Burton" <...@xtramax.co.uk

?????????

In that case, perhaps a more appropriate name would be cbegin<c_str<
Keith

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 00:44:30 -0800 (PST), Artyom Beilis <...@yahoo.com

----- Original Message -----

Before I address specific points in the draft I'd like to say - it is

not the way to go.

In order to make Unicode work, we need two things:

1. First of all to define in the standard

   that any compiler should be able to treat literals as UTF-8 and the input
   text as UTF-8 text and recommend that it would be the default.

   This would make the developers life much easier whether they
   develop for "Wide" Unicode or for the Narrow UTF-8.

2. The standard does not define what locales are actually supported and
   how they are defined.

   The standard should define explicitly that UTF-8 locales must be
   supported.

The rest become trivial:

   std::wcout << L"שלום"

and

   std::cout << "שלום"

Would work and much more.

We are all working so hard to workaround a design flaw of C++ and C++ standard
library that allows ANSI encoding and works with them.

If the standard would require and recommend to handle UTF-8 by default
we would not have all the boost::filesystem::path::imbue and other
stuff that make the life a nightmare.

If we want to go forward with Unicode we need to deprecate non-UTF encoding
we should have UTF-8, UTF-16 and UTF-32 by default, or defined
in compilation time in C++ and let the standard library to handle it.

Take a look on what Go did. All modern languages are Unicode by their
nature, net C++ be as well.

All other stuff is just a workaround of a deeper problem and makes the
programming harder.

Now it would be possible if the standard committee would vote for it.

--------------------------------------------

Now some specific points about converting iterator:

It is fine for Unicode encoding conversion but it is very problematic

for non-Unicode encodings.

Small note:

Iterator is bad design for general encoding conversion for several reasons:
In many cases conversion is stateful and iterator is this case is not the best concept.

Some conversions require complex algorithms that should be be inlined

but rather implemented with ineritence.Using iterator would require
several Virtual function calls per character with trivial implementation
and would be very complex as would require buffering techinques withing
iterator, that is why codecvt iterface is actually good for encoding
conversion (even thought it has a design flaw with mbstate_t that
is useless for implementing stateful encoders, but if mbstate_t
was something reasonable it would be very good interface).

Now I explain why.

1. In some cases you may want to perform normalization before conversion
   or some other operations because it is not always correct assumption

   that  XYZ-encoding-character <-
   Sometimes several characters may be join to single code point and
   the other way around.

2. When you operate on complex encodings it is better to pass a buffer
   for performance. Because conversion algorithm would work
   much better on a chunk of text rathern over some API.

   Even take a look on MSVC standard library. The wide to narrow
   conversion calls codecvt for **every** code point rather
   then using buffers. So do you really expect that
   implementations would actually create an efficent iterators?

Bottom line.

The "iterators range" is not good method for handling.

-------------------------------------

Iterator concept.

This paper does not require what iterator concept is defined?
Input? Output? Forward, Bidirectional? Random?

For some encodings it can work as bidirectional or even random
iterator, for some it may be forward iterator only.

-----------------------------

So I don't really think this is a way to go.

 
Artyom Beilis
--------------
CppCMS - C++ Web Framework:   http://cppcms.com/
CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sat, 28 Jan 2012 21:48:08 +0200, Yakov Galka <...@gmail.com

My opinion:

1. You shall not use any char type other than char and wchar_t for
working with strings. Using the char type and/or char_traits to mark the
encoding doesn't work. This is because the standard provided facets, C
standard library functions etc. are provided almost only for char and
wchar_t types. And we *don't want* to specialize all possible facets for
each possible encoding, just as we don't want to add u16sprintf,
u32sprintf, u16cout, u32cout, etc... This would effectively increase the
size of the interface to ϴ(number-of-entities × number-of-encodings).
Following the above you won't use char32_t and char16_t added in C++11
either. You will use just one or two encodings internally that will be
those used for char and wchar_t according to the conventions in your code
and/or the platform you work with. The only place you may need the char**_t
types is when converting from UTF-16/UTF-32 into the internal encoding you
use for your strings (either narrow or wide). But in those conversion
algorithms uint_least32_t and uint_least16_t suit your needs just fine.

2. "Standard library strings with different character encodings have
different types that do not interoperate." It's good. There shall no be
implicit conversions in user code. If the user wants, she shall specify the
conversion explicitly, as in:

s2 = convert-with-whatever-explicit-interface-you-like("foo");

3. "...class path solves some of the string interoperability
problems..." Class path forces the user to use a specific encoding that she
even may not be willing to hear of. It manifests in the following ways:
- The 'default' interface returns the encoding used by the system,
requiring the user to use a verbose interface to get the
encoding she uses.
- If the user needs to get the path encoded in her favorite encoding
*by reference* with a lifetime of the path (e.g. as a parameter
to an async
call), she must maintain a long living *copy* of the temporary returned
from the said interface.
- Getting the extension from a narrow-string path using boost::path
on Windows involves *two* conversions although the system is never called
in the middle.
- Library code can't use path::imbue(). It must pass the
corresponding codecvt facet everywhere to use anything but the
(implementation defined and volatile at runtime) default.

4. "Can be called like this: (example)" So we had 2 encodings to
consider before C++11, 4 after the additions in C++11 and you're proposing
additions to make it easier to work with any number of encodings. We are
moving towards encoding HELL.

5. "A "Hello World" program using a C++11 Unicode string literal
illustrates this frustration:" Unicode string literal (except u8)
illustrates how adding yet another unneeded feature to the C++ standard
complicates the language, adds problems, adds frustration and solves
nothing. The user can just write

cout << u8"您好世界";

Even better is:

cout << "您好世界";

which *just works* on most compilers (e.g. GCC: http://ideone.com/lBpMJ)
and needs some trickery on others (MSVC: save as UTF-8 without BOM). A much
simpler solution is to standardize narrow string literals to be UTF-8
encoded (or a better phrasing would be "capable of storing any Unicode
data" so this will work with UTF-EBCDIC where needed), but I know it's too
much to ask.

6. "String conversion iterators are not provided (minus Example)" This
section *I fully support*. The additions to C++11 pushed by Dinkumware are
heavy, not general enough, and badly designed. C++11 still lacks convenient
conversion between different Unicode encodings, which is a must in today's
world. Just a few notes:
- "Interfaces work at the level of entire strings rather than
characters," This *is* desired since the overhead of the temporary
allocations is repaid by the fact that optimized UTF-8↔UTF-16↔UTF-32
conversions need large chunks of data. Nevertheless I agree that iterator
access is sometimes preferred.
- Instead of the c_str() from "Example" a better approach is to
provide a convenience non-member function that can work on any range of
chars. E.g. using the "char type specifies the encoding" approach this
would be:

std::wstring wstr = convert<wchar_t construct an std::string
std::string u8str = convert<char
7. True interoperability, portability and conciseness will come when
we standardize on *one* encoding.

On Sat, Jan 28, 2012 at 18:46, Beman Dawes <...@acm.org

Sincerely,
--
Yakov

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 10:52:28 -0500, Beman Dawes <...@acm.org

On Sat, Jan 28, 2012 at 2:48 PM, Yakov Galka <...@gmail.com
I agree with you that "we *don't want* to specialize all possible facets for
each possible encoding, just as we don't want to add u16sprintf,
u32sprintf, u16cout, u32cout, etc...". Hopefully someone will step
forward with a set of deeply Unicode aware generic algorithms to take
advantage of Unicode specific functionality.

I personally prefer char32_t and char16_t to uint_least32_t and
uint_least16_t, but don't have enough experience to the C++11 types to
make blanket recommendations.

int x;
long y;
...
y = x;
...
x = y;

Nothing controversial here, and very convenient. The x = y conversion
is lossy, but the semantics are well defined and you can always use a
function call if you want different semantics.

string x;
u32string y;
...
y = x;
...
x = y;

Why is this any different? It is very convenient. We can argue about
the best semantics for the x = y conversion, but once those semantics
are settled you can always use a function call if you want different
semantics.

My contention is that class path is having to take on conversion
responsibilities that are better performed by basic_string. That part
of the motivation for exploring ways string classes could take on some
of those responsibilities.

The number of encodings isn't a function of C++, it is a function of
the real-world. Traditionally, there were many encodings in wide use,
and then Unicode came along with a few more. But the Unicode encodings
have enough advantages that users are gradually moving away from
non-Unicode encodings. C++ needs to accommodate that trend by becoming
friendlier to the Unicode encodings.

I'm not sure that is too much to ask for the C++ standard after C++11,
whatever it ends up being called. It would take a lot of some careful
work to bring the various interests on board. A year ago was the wrong
point in the C++ standard revision cycle to even talks about such a
change. But C++11 has shipped. Now is the time to start the process of
moving the problem onto the committee's radar screen.

While I'm totally convinced that conversion iterators would be very
useful, the exact form is an open question. Could you be more
specific about the details of your convert suggestion?

Even if we are only talking about Unicode, multiple encodings still
seem a necessity.

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, 30 Jan 2012 19:00:53 +0200, Yakov Galka <...@gmail.com

On Sun, Jan 29, 2012 at 17:52, Beman Dawes <...@acm.org[...]

I don't care for the name. I claim that we don't need a distinct type with
a keyword for that.

It is controversial. It was inherited from C where even void* -conversion was possible. Some argue that x = y should be an error. See D&E
14.3.5.2. Most compilers issue a warning for this. Note that where
compatibility with C is not a concern, C++ prohibits narrowing conversions:

vector<intvector<shortvector<long
Btw, x = y is implementation-defined if y is a large negative, not "well
defined".

string x;

Convenient: yes. But not every convenient feature is good. It can do harm.
First two things that come to mind are:

1. Overload resolution ambiguity or surprising results.
2. It hides potentially expensive conversions (I agree to do these
implicitly only when interacting with 3rd-party code).
3. It eases different encodings interoperability, thus postponing
one-encoding standardization, yet doesn't solve the headache completely
(still the user has to think about encodings and choose a string she needs
from this zoo: string, u16string, u32string...).

And why don't we have std::string::operator const char*()?

Good. But my intent is to move the conversions either inside operational
functions (preferable). Till we can't standardize on a Unicode execution
character set let the conversion happen when calling those functions
(perhaps use a path_ref that does it implicitly if we don't want the FS v2
templated functions). I remind that class path is used not just for calling
the system.

Sure. But it doesn't mean that it have to be friendlier to ALL Unicode
encodings.

Thanks for the forecast!

The point is that it's more like a free-standing c_str() you proposed.
Unlike c_str() member function it would work on any character range, and
returns a range of converting iterators. We don't need to extent
basic_string for this, which is already too big.

Unicode algorithms work on code points (UCS-4) internally. Everything else
can be encoded in some (narrow) execution character set capable of storing
Unicode. Almost no-one implements Unicode algorithms, thus we can
practically assume that one encoding is sufficient on each platform.

--
Yakov

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Tue, 31 Jan 2012 10:33:41 -0500, Beman Dawes <...@acm.org

On Mon, Jan 30, 2012 at 12:00 PM, Yakov Galka <...@gmail.com
The only way I can see that work with a totally unchanged basic_string
would involve a temporary, which I was trying to avoid. Although with
move semantics the temporary isn't as expensive as is used to be.

If basic_string changed to accept range templates (which others may
propose), a free-function approach would work (pending the details of
the range proposal).

If basic_string changed to accept single iterator templates, a
free-function conversion iterator generator approach would work.

I've added these three alternative solutions to the paper, and given
you credit in the acknowledgments. Thanks!

See http://beman.github.com/string-interoperability/tr2-proposal.html

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Tue, 31 Jan 2012 10:41:51 -0500, Beman Dawes <...@acm.org

On Mon, Jan 30, 2012 at 12:00 PM, Yakov Galka <...@gmail.com

That's totally at odds with my experience. A client deals with many
database files ever day from many different sources. Most are encoded
in UTF-8, but some are encoded in UTF-16 or non-Unicode schemes.
That's life. Get over it:-)

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 01:49:16 +0100, Mathias Gaunard <...@ens-lyon.org

No, that's just wrong.
That's not the model that C++ uses. By not storing it with the BOM,
you're essentially tricking MSVC into believing it is ANSI (windows-1252
on western systems), and thus avoiding source character set to the
execution character set, since those happen to be the same.

The way a C++ compiler is supposed to work is that all of your source is
in the source character set, regardless of the type of string literal
you use.
Then the compiler will convert your source character set to the
execution character set for narrow string literals, to the wide
execution character set for wide string literals, to UTF-8 for u8
literals, etc.

The correct way to portably use Unicode characters in a C++ source is to
write it as UTF-8 and ensure that all compilers will consider the source
character set to be UTF-8. Then use the appropriate literal types
depending on what encoding you want your string literals to end up in.
Of course, in the real world, it causes two practical problems:
- MSVC requires a BOM to be present, but GCC will choke if there is one
- In the lack of u8 string literals, you're stuck with wide string
literals if you want something resembling Unicode, unless you use narrow
string literals with just ASCII and escape sequences (\xYY, \u and \U
will not work since it will convert)

What probably should be done is that compilers should be compelled to
support UTF-8 as the source character set in a unified way.

I once asked volodya if it were feasible to implement this in the build
system (add a BOM for MSVC), but he didn't seem to think it was worth it.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 08:08:33 +0200, Yakov Galka <...@gmail.com

On Sun, Jan 29, 2012 at 02:49, Mathias Gaunard <...@ens-lyon.org

Sorry for not being clear enough. I agree and I've not said otherwise. The
second 'cout' line *is* a hack. I admit it won't work if you mix such
string literals with wide literals or external identifiers containing
Unicode. The intent was to show how it could be done if the effort was
focused on making narrow string literals "Unicode compatible".

[...] What probably should be done is that compilers should be compelled to

Yes, it could be nice. It would solve half the problem, which is a huge
step forward given the current mood of the committee. However, embedding
Unicode string literals in source code is still not something you routinely
do. Internationalization usually uses external string tables.

I once asked volodya if it were feasible to implement this in the build

I don't understand. MSVC already understands BOM, and GCC has already been
fixed according to
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415(didn't test it).

On Sun, Jan 29, 2012 at 03:12, Mathias Gaunard <...@ens-lyon.org

+1
This topic has been raised here in program-options context:
http://boost.2283326.n4.nabble.com/program-options-Some-methods-take-const-char -others-take-std-string-td3733894.html

--
Yakov

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 00:13:01 -0800 (PST), Artyom Beilis <...@yahoo.com

Not right. Sometimes you do want non ASCII symbols in the source code,
what is wrong to have © in the text or € symbol in the code.

Also the fact that C++ does not define Unicode source code is
standard design problem, there is nothing wrong to have
Unicode literals in the source code.

In fact the ONLY modern compiler that deos not suppor them is Vistual Studio,
all others I had ever used (gcc, clang, intel, sunstudio) work fine
with UTF-8.

Few points.

1. BOM should not be used in source code, no compiler except MSVC uses it and most
   do not support it.

   BOM is totally stupid for UTF-8 as it does not have "byte order" so it should
   just die for UTF-8.

2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding which
   makes BOM useless (crap... sory) with MSVC even more.

 
Artyom Beilis
--------------
CppCMS - C++ Web Framework:   http://cppcms.com/
CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 14:33:26 +0100, Mathias Gaunard <...@ens-lyon.org

They all support it, the problem is that they require different things
to use it.

According to Yakov, GCC supports it now.
It would be nice if it could work without any BOM though.

That's the correct behaviour. Use u8 string literals if you want UTF-8.
The problem is only present if the compiler doesn't have those string
literals.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 05:53:29 -0800 (PST), Artyom Beilis <...@yahoo.com

----- Original Message -----

Not, MSVC does not allow to create both "שלום" and L"שלום" literal
as Unicode (utf-8, UTF-16) for all other compilers it is default
behavior.

GCC's default input and literal encoding is UTF-8. BOM is not needed.

No, it is unspecified behavior according to the standard.

Standard does not specify what narrow encoding should be used, that
is why u8"" was created.

All (but MSVC) compilers create UTF-8 literals and use UTF-8 input
and this is the default.

Why on earth should I do this?

All the world around uses UTF-8. Why should I specifiy u8"" if it is
something that can be easily defined at compiler level?

All we need is some flag for MSVC that tells that string
literals encoding is UTF-8.

I think the standard should require a method for specification
of input encoding and literals encoding and require UTF-8 input
and literal encoding support whether it is by adding
some flag or by providing some pragma.

AFAIR, neither gcc4.6 nor msvc10 supports u8"".

Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 15:28:57 +0100, Mathias Gaunard <...@ens-lyon.org

And it shouldn't.
String literals are in the execution character set. On Windows the
execution character set is what it calls ANSI. That much is not going to
change.

That's not what I'm saying. What we want is a unified way to set UTF-8
as the source character set.
The problem is that MSVC requires BOM, but GCC used to not allow it.

It isn't.

The standard specifies that it is the execution character set. MSVC
specifies that for its implementation, the execution character set is ANSI.

That's because for those other compilers, you are in a case where the
source character set is the same as the execution character set.

With MSVC, if you don't do anything, both your source and execution
character sets are ANSI. If you set your source character set to UTF-8,
your execution character set remains ANSI still.

On non-Windows platforms, UTF-8 is the most common execution character
set, so you can have a setup where source = execution = UTF-8, but you
can't do that on Windows.
But that is irrelevant to the standard.

Because it makes perfect sense and it's the way it's supposed to work.

Because otherwise you're not independent from the execution character set.
Writing you program with Unicode allows you to not depend on
platform-specific encodings, that doesn't mean it makes them go away.

I repeat, narrow string literals are and will remain in the execution
character set. Expecting those to end up as UTF-8 data is wrong and not
portable.

That "flag" is using the u8 prefix on those string literals.
Remember: the encoding used for the data in a string literal is
independent from the encoding used to write the source.

Unicode string literals have been in GCC since 4.5.

However there are indeed practical problems with using the standard
mechanisms because they're not always implemented.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 17:11:24 +0200, Yakov Galka <...@gmail.com

On Sun, Jan 29, 2012 at 16:28, Mathias Gaunard <...@ens-lyon.org

Execution character set is defined by the implementation, that is the
compiler and the runtime library. It has nothing to do with the system
underneath. That is the implementation is free to decide that execution
character set is UTF-8, even though Windows narrow strings are some 'ANSI'.
Standard library interfaces then would accept UTF-8 (fopen, fstream, etc..).

As said above you can't deduce from the standard what is the "execution
character set for Windows". MSVC defines it to be 'ANSI', which is the
source of all problems. But it is unspecified behavior according to the
standard.

Standard does not specify what narrow encoding should be used, that

Yes, and we would like to at least have a flag that overrides the execution
character set to UTF-8.

As per C++11 it doesn't make sense to use any other narrow string literal
but u8"". Why would you use plain "" on Windows?

[...]

Yes, it will remain independent even with "" meaning u8"". Even if the
source character set was UTF-32 it would mean UTF-8.

Sincerely,
--
Yakov

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 07:14:57 -0800 (PST), Artyom Beilis <...@yahoo.com

----- Original Message -----

It depends on the point of view. (see below)

Execution character set is host dependent and ANSI code
page differs from one host to another. When you compile
the program on one host with one character set the program
will not behave correctly on other host.

This is a huge bug in C++ design.
That is why it should be fixed.

Most compilers around already did this...

It can be done in backward compatible way by requiring
compilation time option and deprecating the concept
of "execution character set"

The problem is not BOM or not BOM. BOM is not way to fix the problem.

All concept of "BOM" to distinguish between ANSI encoding and UTF-8
exist only on Windows. It is not portable and most importantly
stupid thing to provide "Byte-Order-Mark" for UTF-8 that does
not have byte order. GCC provides a flag to specify encoding,
AFAIR most of other compilers do the same.

It is, because host character set is not well defined and
it varies from host to host. So the result is just not specified.

I'll make it more clear: **It is not well defined**.

So may be standard should add an option to specific the input
character set explicitly so it would not vary from host to host?

GCC allows to specify both "" literal encoding and input encodings.

-finput-charset and -fexec-charset options.

No it will remain the original ANSI encoding that may not much the host ANSI
encoding.

Input CP-XXX  "test" -
But in runtime it would be CP-YYY != CP-XXX

As I tould you standard should specify a way to define both execution and input
character set.

Except that it does not solve any real problem.

I remind UTF-8 is Unicode...

I thing it is a bug in a design and the programmer should be able
to override it.

Finally the "execution" character set is meaningless as it host dependent,
the "narrow-literal" character set is meaningful.

I know

AFAIR GCC supports u"" and U"" when I checked u8"" it was not working
but I may be wrong.

Artyom Beilis

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sun, 29 Jan 2012 18:25:57 -0500, Beman Dawes <...@acm.org

On Sat, Jan 28, 2012 at 7:49 PM, Mathias Gaunard
<...@ens-lyon.org
Makes sense to me.

Why don't you write up an issue for the C and C++ committees? My
guess it would be well received as long (1) C and C++ stay in sync (or
at least don't conflict), and (2) compiler vendors aren't required to
do anything that would prevent existing source files that work with
their compiler to no longer work. This issue might well attract
national body support, which increases the chance the committee will
take action.

It would be helpful if the issue write up included a survey of current
compilers so that committee members not familiar with various
compilers could see that UTF-8 is already widely supported modulo the
BOM issue.

Another possibility is to start lobbying compiler vendors, or at least
Microsoft, to support UTF-8 both with and without BOM.

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, 30 Jan 2012 00:24:30 -0800 (PST), Artyom Beilis <...@yahoo.com

----- Original Message -----

It is not only BOM not BOM issue. It is mostly the ability
to define execution character set. i.e. character set for
normal "some text" literals and the input character set
and what is even more important that C++ compilers must
support UTF-8 for the two of them.

Artyom

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Tue, 31 Jan 2012 03:57:58 -0500, Daryle Walker <...@hotmail.com

----------------------------------------

This probably isn't the right post to respond to, but I don't want to spend forever figuring it out.

Not every system is a 8/16/32(/64)-bit computer using ASCII/Latin-1/UTF-8.  C++ (from C) was designed so a user with a 9/36/81-bit EBSDIC system and one with a 8/16/32/64 UTF-16 system can write programs for the other (with the appropriate cross-compiler).  We don't want to obnoxiously be prejudiced against systems not matching the current configuration trends.

(I was originally going to write "9/36/72", but then realized that higher types only have to be a multiple of char, not each other, so my new system breaks more common-programmer assumptions.  BTW, that's 9-bit bytes (char), 36-bit words (short and int), and 81-bit long-words (long and long-long).  I wonder if anyone here can fabricate this custom hardware, to mess people up.)

Daryle W.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Tue, 31 Jan 2012 11:52:30 +0200, Yakov Galka <...@gmail.com

On Tue, Jan 31, 2012 at 10:57, Daryle Walker <...@hotmail.com
Thanks Daryle. I'm aware of this issue and thus restrained from talking
about UTF-8 only. The wording I'm interested in is "execution character set
is capable of storing any Unicode data". This would mean that it will be
UTF-8 on systems having CHAR_BIT==8 and compatible with ASCII, UTF-EBCDIC
on IBM mainframes, perhaps UTF-32 on DSP with CHAR_BIT==32 and sizeof(char)
== sizeof(long). Yet another option is to restrict the requirement to
hosted implementations only.

--
Yakov

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Tue, 31 Jan 2012 13:12:54 +0100, Mathias Gaunard <...@ens-lyon.org

Which is exactly why forcing a particular execution character set is a
bad idea.
Forcing a particular source character set, however, may be another
matter, as it only affects the compiler itself.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Tue, 31 Jan 2012 15:13:11 +0100, Olaf van der Spek <...@vdspek.org

On Tue, Jan 31, 2012 at 1:12 PM, Mathias Gaunard
<...@ens-lyon.org
Wouldn't it affect editors and other utilities too?

--
Olaf

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Tue, 31 Jan 2012 21:33:39 +0100, Mathias Gaunard <...@ens-lyon.org

Not necessarily, a compiler can support multiple source character sets.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost