Omgili, forum search, forums search, search forums, discussion search,discussions search, search discussions, board search, boards search, search boards
  Advanced Search

[boost] boost filesystem path as utf-8?

On Mon, 23 Jan 2012 00:15:33 -0800, Emil Dotchevski <...@gmail.com

Hello,

I understand that the path class .native() member function's return type
differs depending on the platform (wstring on windows, for example.) Is
there a way to get the path as a utf-8 string regardless of the platform?
Likewise, is there a way to construct a path object from a utf-8 string
regardless of the platform?

Emil Dotchevski
Reverge Studios, Inc.
http://www.revergestudios.com/reblog/index.php?n=ReCode

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost



On Mon, 23 Jan 2012 00:33:42 -0800 (PST), Artyom Beilis <...@yahoo.com

When you are using Boost.FileSystem.v3 you can imbue
a locale with UTF-8 codecvt facet globally using.

   boost::path::imbue()

Note path::imbue is static member function.

 
Artyom Beilis
--------------
CppCMS - C++ Web Framework:   http://cppcms.sf.net/
CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/

----- Original Message -----

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, 23 Jan 2012 11:46:33 +0200, Yakov Galka <...@gmail.com

As Artyom said you can imbue whatever locale you want to specify the
conversion form narrow to wide strings. It will make almost all the
conversions transparent, except that the path will still be stored as
UTF-16 on windows. Unfortunately it boils to the interface whence you can
get a c_str() to a UTF-16 string only.

You may want to revert to Boost.Filesystem.v2 (afaik removed completely in
1.48 so you'll need to merge from the old release), it is better designed
in the sense that it has a templatized basic_path that allows you to store
utf-8 encoding internally (once you imbue the correct locale) and convert
to UTF-16 on demand.

On Mon, Jan 23, 2012 at 10:33, Artyom Beilis <...@yahoo.com

--
Yakov

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, 23 Jan 2012 07:47:02 -0500, Beman Dawes <...@acm.org

On Mon, Jan 23, 2012 at 4:46 AM, Yakov Galka <...@gmail.com
So far, so good.

That's not correct.

If you have a path p, and the imbued codecvt if UTF-8, you can always
get a UTF-8 narrow string by writing p.string<std::stringcan always write p.string<std::stringchar* to a UTF-8 encoded narrow string.

If your app mostly needs UTF-8 strings, use std::string and only
convert to a path when a path is actually needed.

If your app mostly needs paths, use boost::filesystem::path and only
convert to std::string when a std::string or const char* is actually
needed.

V2 is no longer supported and bugs are not being fixed.

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, 23 Jan 2012 16:28:13 +0200, Yakov Galka <...@gmail.com

On Mon, Jan 23, 2012 at 14:47, Beman Dawes <...@acm.org

It's correct. I state that path::c_str() returns UTF-16 on Windows. It's a
fact. So the encoding isn't an implementation detail but a part of the
interface. So you can do a conversion, but it has different semantics
because....

...it has a different life time. path::c_str() has the same lifetime as the
path, so would have the utf8-path::c_str().

If your app mostly needs UTF-8 strings, use std::string and only

My app needs UTF-8 paths. Don't use the term 'path' as synonym for
'boost::filesystem::path'. There are other paths in the world (QDir,
Poco::Path) and yours are neither special nor better.

--
Yakov

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, 23 Jan 2012 14:52:48 -0500, Beman Dawes <...@acm.org

On Mon, Jan 23, 2012 at 9:28 AM, Yakov Galka <...@gmail.com

As quoted above, you said only that "...the interface whence you can get a
c_str() to a UTF-16 string only."

The interface includes multiple observers, which return values with various
encodings other than UTF-16. The return types from the observers allow
c_str() to access those values.

During the design discussions, two other alternatives were discussed. (1)
Always hold the path internally in a char string encoded UTF-8. The cost on
Windows is that a conversion has to be done before every file system
operation. The cost on POSIX is that a double conversion has to be done
before every file system operation if the encoding is not UTF-8. (2) Hold
two strings internally, one in the native type and encoding, the other in
UTF-8. The cost is trying to keep them in sync, with the conversions that
implies, for some definition of "in sync".

If class std::basic_string itself had better support for string
interoperability, class path would be able to side step at least some of
the conversion headaches.

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Tue, 24 Jan 2012 11:44:20 +0200, Yakov Galka <...@gmail.com

On Mon, Jan 23, 2012 at 21:52, Beman Dawes <...@acm.org
Don't be picky at words. Yes, this sentence might be ambiguous. But I
say that the correct resolution, using C++ name lookup rules, is "you
can get a path::c_str() to a UTF-16 string only".

Since you didn't read it, I'll repeat it again: path::string().c_str()
is a *temporary*. path::c_str() is NOT. The two has difference
semantics, and your library starting with version 3 doesn't let the
user choose what string path holds inside. As said above, it's not an
implementation detail since it's observable from the interface.

Not an issue, because:
1) last time I measured with CreateFile and a naive implementation
using MultiByteToWideChar it took less than 3% overhead. Faster
conversions routines exist and you will have to do the conversions
anyway when you communicate with the external world.
2) Let the user choose between narrow chars and wide chars. Why do you
force me to use the later? Why getting the filename from a UTF-8
std::string must involve 2 conversions (to and from UTF-16) even if I
don't pass anything to the system?

1) Most POSIX systems use UTF-8 these days.
2) It's fine if it will be the native encoding on POSIX, as long as
the user can override it. On windows she just can't do this because
boost::path uses wide string.

I 100% agree (2) is not an option.

Maybe, but almost surely not. It would just shift the burden to other
place—the user.

What you didn't say is that *during original filesystem review* it had
a templatized basic_path and the user *could choose* between narrow
and wide strings. Add this option to the list above.

--
Yakov

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, 23 Jan 2012 14:15:58 -0800, Emil Dotchevski <...@gmail.com

On Mon, Jan 23, 2012 at 4:47 AM, Beman Dawes <...@acm.org
How exactly do I imbue UTF-8 codecvt in a path? I Googled around and
couldn't find anything.

Emil Dotchevski
Reverge Studios, Inc.
http://www.revergestudios.com/reblog/index.php?n=ReCode

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Mon, 23 Jan 2012 18:41:32 -0500, Beman Dawes <...@acm.org

On Mon, Jan 23, 2012 at 5:15 PM, Emil Dotchevski
<...@gmail.com

There are two approaches:

* If you always want all class path arguments and returned values with
a value type of char to be treated as being UTF-8 encoded, and aren't
worried about changing a potentially dangerous global, then do this:

#include <boost/filesystem/detail/utf8_codecvt_facet.hpp ...
std::locale global_loc = std::locale();
std::locale loc(global_loc, new
boost::filesystem::detail::utf8_codecvt_facet);
boost::filesystem::path::imbue(loc);

* If you only want one specific path to treat its narrow character
arguments and returns as UTF-8, do this:

boost::filesystem::detail::utf8_codecvt_facet utf8;
...
boost::filesystem::path p;
...
p.assign(u8"...", utf8); // many other path functions can take a
codecvt argument, too

By the way, you can use a UTF-8 codecvt facet from someone else if you prefer.

HTH,

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Tue, 24 Jan 2012 22:55:37 -0800, Emil Dotchevski <...@gmail.com

On Mon, Jan 23, 2012 at 3:41 PM, Beman Dawes <...@acm.org
Beman, thanks for your detailed answer! I have one more question: is
it possible to get a grammatically native string, encoded in UTF-8
regardless of the platform?

Emil Dotchevski
Reverge Studios, Inc.
http://www.revergestudios.com/reblog/index.php?n=ReCode

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Discussion Title: [boost] boost filesystem path as utf-8?
Title Keywords: [boost]  boost  filesystem  path  utf-8?