Discussion:
[soci-users] Support of MSSQL Server
h***@code-styling.de
2013-06-23 11:03:32 UTC
Permalink
I need to support MSSQL Server with proper Microsoft UNICODE usage (nvarchar
column etc.) from UCS2 to UTF8 and reverse,
cause MSSQL Server is the only one database engine, the is still not able to
store/retreive UTF8 (even if the actual server is 2012)!
My plan is to clone the ODBC backend and come up with a first version of a new
MSSQL interface.
Later on i would modifiy it to use the native client capabilities for better
performance.
This also will come up with conversation helper from UTF16 (UCS2) to UTF8 and
reverse.

But there is a global issue doing so. I need the extends the core with support
of

#define SQL_WCHAR (-8)
#define SQL_WVARCHAR (-9)

with a new core data_type

dt_wstring

and exchanged to

std::wstring.


Does somebody have had a problem doing so?
I'm comming from the Windows world so any objections in terms of Unix world are
welcome upfront.
Should the support additionally enrolled to any other backend too if the backend
is able to support such types using std::wstring?

regards

Heiko
Mateusz Loskot
2013-06-23 23:57:49 UTC
Permalink
Post by h***@code-styling.de
I need to support MSSQL Server with proper Microsoft UNICODE usage (nvarchar
column etc.) from UCS2 to UTF8 and reverse,
cause MSSQL Server is the only one database engine, the is still not able to
store/retreive UTF8 (even if the actual server is 2012)!
My plan is to clone the ODBC backend and come up with a first version of a
new MSSQL interface.
Later on i would modifiy it to use the native client capabilities for better
performance.
This also will come up with conversation helper from UTF16 (UCS2) to UTF8
and reverse.
But there is a global issue doing so. I need the extends the core with
support of
#define SQL_WCHAR (-8)
#define SQL_WVARCHAR (-9)
with a new core data_type
dt_wstring
and exchanged to
std::wstring.
Does somebody have had a problem doing so?
Generally, don't have any problem with supporting std::wstring as
additional string type.
Specifically, there are number of issues thet need to be consider:

http://thread.gmane.org/gmane.comp.db.soci.user/397/focus=399


You may try to contact Sören Meyer-Eppler who seemed
to be working on wstring long time ago:

http://soci.6940.n7.nabble.com/SOCI-users-SOCI-unicode-again-td954.html#a955

Unfortunately, I don't have his patches.
Post by h***@code-styling.de
I'm comming from the Windows world so any objections in terms of Unix world
are welcome upfront.
Should the support additionally enrolled to any other backend too if the
backend is able to support such types using std::wstring?
Yes, it would be necessary, at least if into(wstring) or use(wstring) is used,
those strings would have to be converted to UTF-8, I imagine.

In any case, feel free to submit pull request (please, use dedicated branch),
so anyone interested can help you with reviewing and testing.
I hope others will chime in with helpful ideas.

Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net
Mateusz Loskot
2013-06-24 00:05:00 UTC
Permalink
Post by h***@code-styling.de
I need to support MSSQL Server with proper Microsoft UNICODE usage (nvarchar
column etc.) from UCS2 to UTF8 and reverse,
cause MSSQL Server is the only one database engine, the is still not able to
store/retreive UTF8 (even if the actual server is 2012)!
My plan is to clone the ODBC backend and come up with a first version of a
new MSSQL interface.
Later on i would modifiy it to use the native client capabilities for better
performance.
I forgot to address the use of native client, I think it's a very good idea,
it's been around for a while:

https://github.com/SOCI/soci/wiki/Roadmap

Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net
Vadim Zeitlin
2013-06-24 13:36:52 UTC
Permalink
On Sun, 23 Jun 2013 13:03:32 +0200 (CEST) "***@code-styling.de" <***@code-styling.de> wrote:

hs> My plan is to clone the ODBC backend and come up with a first version
hs> of a new MSSQL interface.

Hello,

Having a native MS SQL backend would be very nice to have in any case,
even independently of the Unicode issues, so it would be definitely great
to at least start working on this.

hs> But there is a global issue doing so. I need the extends the core with
hs> support of
hs>
hs> #define SQL_WCHAR (-8)
hs> #define SQL_WVARCHAR (-9)
hs>
hs> with a new core data_type
hs>
hs> dt_wstring
hs>
hs> and exchanged to
hs>
hs> std::wstring.

This would be already useful and AFAICS shouldn't be a big problem to
implement but I think it would be really useful to also allow exchange
between Unicode database fields and UTF-8-encoded std::string (no "w").
What do you think?

hs> Should the support additionally enrolled to any other backend too if
hs> the backend is able to support such types using std::wstring?

Ideally all this should be backend-independent, of course. Even if the
backend doesn't specifically support Unicode at all, we could just encode
std::wstring in UTF-8 and use the existing code. In fact, I'd think that
this would be the right thing to do for all the other backends but I didn't
actually check it.

Regards,
VZ
Sergey Stepanov
2013-06-24 13:50:20 UTC
Permalink
Hi, Everyone

If you really interested, I have fully working version of the ODBC backend with Unicode fields implemented.
I can share the code, but it would be really cool if someone will help me pushing those changes to SOCI repository.
In my implementation I used UTF8 based columns support with translation to MSSQL`s UCS2 and back.

Regards,
Stepanov Sergey
XMPP: ***@jabber.ru
SKYPE: sergey.y.stepanov
mobile: +7(921)345-78-22
Post by Vadim Zeitlin
hs> My plan is to clone the ODBC backend and come up with a first version
hs> of a new MSSQL interface.
Hello,
Having a native MS SQL backend would be very nice to have in any case,
even independently of the Unicode issues, so it would be definitely great
to at least start working on this.
hs> But there is a global issue doing so. I need the extends the core with
hs> support of
hs>
hs> #define SQL_WCHAR               (-8)
hs> #define SQL_WVARCHAR          (-9)
hs>
hs> with a new core data_type
hs>
hs>        dt_wstring
hs>
hs> and exchanged to
hs>
hs>       std::wstring.
This would be already useful and AFAICS shouldn't be a big problem to
implement but I think it would be really useful to also allow exchange
between Unicode database fields and UTF-8-encoded std::string (no "w").
What do you think?
hs> Should the support additionally enrolled to any other backend too if
hs> the backend is able to support such types using std::wstring?
Ideally all this should be backend-independent, of course. Even if the
backend doesn't specifically support Unicode at all, we could just encode
std::wstring in UTF-8 and use the existing code. In fact, I'd think that
this would be the right thing to do for all the other backends but I didn't
actually check it.
Regards,
VZ
Mateusz Loskot
2013-06-24 15:51:55 UTC
Permalink
Post by Sergey Stepanov
If you really interested, I have fully working version of the ODBC backend with Unicode fields implemented.
I can share the code, but it would be really cool if someone will help me pushing those changes to SOCI repository.
In my implementation I used UTF8 based columns support with translation to MSSQL`s UCS2 and back.
Sergey, creating dedicated branch in your fork and posting URL here
would be a good start point for reviewing and discussing it.

In next weeks, I'll be busy with preparing 3.2.2 and applying the two structural
changes in the source tree: new layout and buried headers.
Once that's done, I think we can start accepting pull requests for SOCI 4.0.0,
or we can release wstring support even earlier, as semi-big milestone, in 3.5.0.

Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net
h***@code-styling.de
2013-06-24 16:04:51 UTC
Permalink
Hi Sergey,

I'm interested into the modified version, you can send me a zip file if you
want.
In meantime i have done a complete implementation of a second string type
(std::wstring) beside the existing std::string type.
If have tested it with vectors, use/into and also with row and rowset support.

I would prefere to have this as additional type beside the standard one to have
full control over converting. Otherwise we have to introduce a new dependency to
one of the conversion libs out there.

Furthermore the problem at MSSQL Server is:

example case 1: varchar(50) column
You have to know wether the collation inside is "ISO-8859-1" or any other to
deal correctly with the std::string data because it's simply ascii at the given
encoding.

example case 2: nvarchar(50) column
You have to interprete at MSSQL the result as UCS2 coded widestring
(std::wstring) and you are now reposible to convert to requested format.

I'm not a friend of behind the scenes conversions, because this may always lead
to unpredictable results in future or real live systems.

I have tested conversions with help of Boost Library as follows:

example case 1: varchar(50) column
string utf8_string =
boost::locale::conv::to_utf<char>(names[i],"ISO-8859-1");
wstring wide_string =
boost::locale::conv::to_utf<wchar_t>(names[i],"ISO-8859-1");

example case 2: nvarchar(50) column
std::string utf8_string =
boost::locale::conv::utf_to_utf<char>(names[i]);

This works properly as expected. At first case you need to know exactly the
collation of the column to get a qualified converted utf8.
So if you have to deal with N databases, all having different ANSI encodings
(ru,pl,ger, sv, etc....) then you need a qualified convert.

If anyone have good reasons for doing implicite converts, please let me know.
Keep in mind, that above converting is portable and WideCharToMultiByte() stuff
isn't.
So I would delegate (rarely) necessary convert to the enduser of library.

BTW: Some C++ middleware code may internally use always std::wstring management
because of sorting, searching, up/downcases and special sorting like
Lexicograpical sorting. (lex sorting: as example the german 'Ì' will not be
sorted to the end of alphabet but within the 'u', this also happens at svenska
and other languages)

regards

Heiko

PS: a push to github will follow during this week, that's my current timeline so
far.
Post by Sergey Stepanov
Hi, Everyone
If you really interested, I have fully working version of the ODBC backend
with Unicode fields implemented.
I can share the code, but it would be really cool if someone will help me
pushing those changes to SOCI repository.
In my implementation I used UTF8 based columns support with translation to
MSSQL`s UCS2 and back.
Sergey Stepanov
2013-06-24 18:43:27 UTC
Permalink
Hi, Heiko

Yes, I can understand, that conversion can be a problem, but std::widestring is a
broblem itself, actually.

Only on windows it represents UCS2, but on all other platforms it is UCS32!

So both ways are error-prone. Question is how to choose between two evils.

I prefer utf8 based implementation cause it is more general and corresponds the
rules of SOCI library from top point of view.

Choosing std::widestring as a base type will cause user of the library to handle
all sort of cross-platform problems instead.

By the way, I tested SOCI with my ODBC backend on about a hundred MSSQL servers -
all working just fine.

As for collation, I think this is internal thing of server itself, ODBC takes
care on conversion in this case (AFAIR).

You can check it by writing simple ODBC application receiving a string from come
DB.

Anyway, will sand you the file tomorrow.

Regards,

Stepanov Sergey

XMPP: ***@jabber.ru

SKYPE: sergey.y.stepanov

mobile: +7(921)345-78-22

Monday June 24, 20:19:20 GMT+0400 2013 пПльзПватель
***@code-styling.de (***@code-styling.de) МапОсал:


Hi Sergey, I'm interested into the modified version, you can send me a zip
file if you want.
In meantime i have done a complete implementation of a second string type
(std::wstring) beside the existing std::string type.
If have tested it with vectors, use/into and also with row and rowset
support. I would prefere to have this as additional type beside the standard
one to have full control over converting. Otherwise we have to introduce a
new dependency to one of the conversion libs out there. Furthermore the
problem at MSSQL Server is:
example case 1: varchar(50) column You have to know wether the collation
inside is "ISO-8859-1" or any other to deal correctly with the std::string
data because it's simply ascii at the given encoding. example case 2:
nvarchar(50) column You have to interprete at MSSQL the result as UCS2 coded
widestring (std::wstring) and you are now reposible to convert to requested
format. I'm not a friend of behind the scenes conversions, because this may
always lead to unpredictable results in future or real live systems. I have
tested conversions with help of Boost Library as follows: example case 1:
varchar(50) column string utf8_string =
boost::locale::conv::to_utf<char>(names[i],"ISO-8859-1");
wstring wide_string =
boost::locale::conv::to_utf<wchar_t>(names[i],"ISO-8859-1");
example case 2: nvarchar(50) column std::string utf8_string =
boost::locale::conv::utf_to_utf<char>(names[i]); This works properly as
expected. At first case you need to know exactly the collation of the column
to get a qualified converted utf8.
So if you have to deal with N databases, all having different ANSI encodings
(ru,pl,ger, sv, etc....) then you need a qualified convert. If anyone have
good reasons for doing implicite converts, please let me know. Keep in mind,
that above converting is portable and WideCharToMultiByte() stuff isn't.
So I would delegate (rarely) necessary convert to the enduser of library.
BTW: Some C++ middleware code may internally use always std::wstring
management because of sorting, searching, up/downcases and special sorting
like Lexicograpical sorting. (lex sorting: as example the german 'Ì' will not
be sorted to the end of alphabet but within the 'u', this also happens at
svenska and other languages) regards Heiko PS: a push to github will follow
during this week, that's my current timeline so far.
Post by Sergey Stepanov
Hi, Everyone
If you really interested, I have fully working version of the ODBC backend
with Unicode fields implemented.
Post by Sergey Stepanov
I can share the code, but it would be really cool if someone will help me
pushing those changes to SOCI repository.
Post by Sergey Stepanov
In my implementation I used UTF8 based columns support with translation to
MSSQL`s UCS2 and back.
Mateusz Loskot
2013-06-24 15:47:32 UTC
Permalink
Post by Vadim Zeitlin
hs> My plan is to clone the ODBC backend and come up with a first version
hs> of a new MSSQL interface.
Hello,
Having a native MS SQL backend would be very nice to have in any case,
even independently of the Unicode issues, so it would be definitely great
to at least start working on this.
I agree.
Post by Vadim Zeitlin
hs> But there is a global issue doing so. I need the extends the core with
hs> support of
hs>
hs> #define SQL_WCHAR (-8)
hs> #define SQL_WVARCHAR (-9)
hs>
hs> with a new core data_type
hs>
hs> dt_wstring
hs>
hs> and exchanged to
hs>
hs> std::wstring.
This would be already useful and AFAICS shouldn't be a big problem to
implement but I think it would be really useful to also allow exchange
between Unicode database fields and UTF-8-encoded std::string (no "w").
What do you think?
UTF-8 is Unicode, but I guess it's a shortcut.

Do you mean a field of UTF-8 and select into(std::wstring)
or field of UTF-16 and insert with use(std::string) ?
Post by Vadim Zeitlin
hs> Should the support additionally enrolled to any other backend too if
hs> the backend is able to support such types using std::wstring?
Ideally all this should be backend-independent, of course.
Yes, but there is set of common types and conversions.
Post by Vadim Zeitlin
Even if the backend doesn't specifically support Unicode at all, we could just encode
std::wstring in UTF-8 and use the existing code. In fact, I'd think that
this would be the right thing to do for all the other backends but I didn't
actually check it.
In fact, currently, we somewhat implicitly assume all std::sting are UTF-8.

If a database speaks UTF-8 only, then it surely is feasible
to use std::wstring as into/use type, as we can apply conversion.
Semantically, it would be equivlent to integer promotion or conversion.
But, that is not the end of the story.

In limited scenario, we can assume only MSSQL/ODBC on Windows
exchange UTF-16 and all the rest exchange UTF-8, then we know
statically what conversion to apply.

The problem is if we want to support conversions like:
anything <->UTF-8
anything <->UTF-16

Then we have to take care of checking what lingo database/client/backend
speak per session. Don't we?

Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net
Vadim Zeitlin
2013-06-24 16:14:05 UTC
Permalink
On Mon, 24 Jun 2013 16:47:32 +0100 Mateusz Loskot <***@loskot.net> wrote:

ML> > hs> But there is a global issue doing so. I need the extends the core with
ML> > hs> support of
ML> > hs>
ML> > hs> #define SQL_WCHAR (-8)
ML> > hs> #define SQL_WVARCHAR (-9)
ML> > hs>
ML> > hs> with a new core data_type
ML> > hs>
ML> > hs> dt_wstring
ML> > hs>
ML> > hs> and exchanged to
ML> > hs>
ML> > hs> std::wstring.
ML> >
ML> > This would be already useful and AFAICS shouldn't be a big problem to
ML> > implement but I think it would be really useful to also allow exchange
ML> > between Unicode database fields and UTF-8-encoded std::string (no "w").
ML> > What do you think?
ML>
ML> UTF-8 is Unicode, but I guess it's a shortcut.

Yes, sorry, I used "Unicode" in Windows sense of the word where it
typically means "wchar_t" (i.e. UTF-16).

ML> Do you mean a field of UTF-8 and select into(std::wstring)
ML> or field of UTF-16 and insert with use(std::string) ?

Both. I.e. IMO ideally any text database fields, be they UTF-8-encoded,
UTF-16-encoded or whatever else[*], should be exchangeable with both
std::string and std::wstring.

ML> In fact, currently, we somewhat implicitly assume all std::sting are UTF-8.

I think it's a perfectly reasonable assumption to make nowadays. It just
needs to be explicit IMO.

ML> In limited scenario, we can assume only MSSQL/ODBC on Windows
ML> exchange UTF-16 and all the rest exchange UTF-8, then we know
ML> statically what conversion to apply.
ML>
ML> The problem is if we want to support conversions like:
ML> anything <->UTF-8
ML> anything <->UTF-16
ML>
ML> Then we have to take care of checking what lingo database/client/backend
ML> speak per session. Don't we?

If we want to support arbitrary encodings, then we definitely need to do
this, but I don't think we need to do this right now. I'd like to determine
the kind of text based solely on the column type. E.g. for MS SQL we have
NCHAR/NVARCHAR/NTEXT which are always UTF-16 (AFAIK) and CHAR/VARCHAR/TEXT
which always some multibyte encoding. Again, I think this is actually the
only one requiring special treatment, the others all use UTF-8 or another
multibyte encoding -- which we don't support right now.

Of course, it would be ideal to support all the different encodings too.
But, again, I don't think anybody needs this right now and doing this would
be much more complicated as it would require querying the database
charset/encoding in each session and doing the conversions (which would in
turn probably require linking with ICU).

So my suggestion would be:

1. Handle only UTF-8 for multibyte encodings right now and, perhaps, throw
an error if we can detect that the database uses anything else.
Formalize this by documenting that std::string used by SOCI is supposed
to always be in UTF-8.

2. Add UTF-16 support for MS SQL and ODBC backends by converting data
to/from UTF-8.

3. Add support for exchanging data with std::wstring. Do it directly for
MS SQL/ODBC or via UTF-8 for all the others.

And the point I was trying to make in my original reply was that IMHO it's
the step (2) that is the most interesting, not so much the step (3) (even
if it would be useful to have it too, but I think any C++ programmer
already uses some library allowing him to easily convert between
UTF-8-encoded std::string and std::wstring anyhow).

Regards,
VZ
Mateusz Loskot
2013-06-25 23:37:39 UTC
Permalink
Post by Vadim Zeitlin
ML> Do you mean a field of UTF-8 and select into(std::wstring)
ML> or field of UTF-16 and insert with use(std::string) ?
Both. I.e. IMO ideally any text database fields, be they UTF-8-encoded,
UTF-16-encoded or whatever else[*], should be exchangeable with both
std::string and std::wstring.
Yes, that's the ideal.
Post by Vadim Zeitlin
ML> In fact, currently, we somewhat implicitly assume all std::sting are UTF-8.
I think it's a perfectly reasonable assumption to make nowadays. It just
needs to be explicit IMO.
Right. Do you mean documentation or implementation is lacking too?
Post by Vadim Zeitlin
ML> In limited scenario, we can assume only MSSQL/ODBC on Windows
ML> exchange UTF-16 and all the rest exchange UTF-8, then we know
ML> statically what conversion to apply.
ML>
ML> anything <->UTF-8
ML> anything <->UTF-16
ML>
ML> Then we have to take care of checking what lingo database/client/backend
ML> speak per session. Don't we?
If we want to support arbitrary encodings, then we definitely need to do
this, but I don't think we need to do this right now.
I agree straight away.
Post by Vadim Zeitlin
I'd like to determine the kind of text based solely on the column type. E.g. for MS SQL we have
NCHAR/NVARCHAR/NTEXT which are always UTF-16 (AFAIK) and CHAR/VARCHAR/TEXT
which always some multibyte encoding. Again, I think this is actually the
only one requiring special treatment, the others all use UTF-8 or another
multibyte encoding -- which we don't support right now.
If we can rely on column types, then it seems we're halfway there, indeed.
I'm not experiencd with types in SQL Server, so I missed that.

Regarding multi byte encoding, we just take what we get and hand over to
client or database, using std::string as array of bytes.
But then, we can not ofer reliable conversion between narrow and wide
characters, of those multi-byte encoded strings, of course.
Post by Vadim Zeitlin
Of course, it would be ideal to support all the different encodings too.
But, again, I don't think anybody needs this right now and doing this would
be much more complicated as it would require querying the database
charset/encoding in each session and doing the conversions (which would in
turn probably require linking with ICU).
(Boost.Locale may be interesting too.)
Post by Vadim Zeitlin
1. Handle only UTF-8 for multibyte encodings right now and, perhaps, throw
an error if we can detect that the database uses anything else.
Formalize this by documenting that std::string used by SOCI is supposed
to always be in UTF-8.
Yes and this answers my question from the beginning.
Post by Vadim Zeitlin
2. Add UTF-16 support for MS SQL and ODBC backends by converting data
to/from UTF-8.
Yes.
Post by Vadim Zeitlin
3. Add support for exchanging data with std::wstring. Do it directly for
MS SQL/ODBC or via UTF-8 for all the others.
Yes.
Post by Vadim Zeitlin
And the point I was trying to make in my original reply was that IMHO it's
the step (2) that is the most interesting, not so much the step (3) (even
if it would be useful to have it too, but I think any C++ programmer
already uses some library allowing him to easily convert between
UTF-8-encoded std::string and std::wstring anyhow).
You are right. Also, being not able to throw solid amount of manpower,
let's make smallest steps possible.

Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net
Mateusz Loskot
2013-06-26 00:53:16 UTC
Permalink
Post by Mateusz Loskot
Post by Vadim Zeitlin
1. Handle only UTF-8 for multibyte encodings right now and, perhaps, throw
an error if we can detect that the database uses anything else.
Formalize this by documenting that std::string used by SOCI is supposed
to always be in UTF-8.
Yes and this answers my question from the beginning.
This part is unclear thus controversial.
See comments in the issue I opened for the documentation task:

https://github.com/SOCI/soci/issues/166

I agree, that if we need to re-encode and source encoding is unknown,
we can assume UTF-8, but we should not assume that std::string carries
UTF-8 for any purpose always. As Aleksander pointer, we still want to be
able to use it as arbitrary bytes container (i.e. binary data).

/me 's getting confused :)

Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net

Loading...