Properly managing memory returned by transcode() in the Xerces XML Library: from memory leak to clean, leak-free exception safety.
Michael D. Crawford
GoingWare Inc.
August 27, 2000
Copyright © 2000 Michael D. Crawford. All Rights Reserved.
Originally posted to xerces-c-dev@xml.apache.org, this is a discussion of how to handle data that is allocated internally to a library function but that is expected to be deleted by the caller.
While apparently a simple problem,
I went through a number of different solutions ranging from just getting the
code to work but leaking memory badly, through something that seemed clean
but is actually just plain
wrong (you cannot use auto_ptr to manage pointers to arrays - it happens to work
with the development system I'm using but will crash on other systems) until with the
help of the folks on
comp.lang.c++.moderated
I came up with a really nice solution that is both easy to use, easy to read in the
source and is exception-safe to boot.
While this might seem to be of the most benefit to folks using the Xerces library for the XML Document Object Model and SAX events from the Apache XML Project, I think it could also be worth reading to anyone interested in how to improve their code, and as an example of the benefit of Refactoring Mercilessly as discussed in Extreme Programming - and a convincing argument against a firm commandment a friend believes in: "Never touch working code".
The two functions DOMString.transcode() and
XMLString::transcode()
return arrays of raw data, in the first case a char * (an old-fashioned
C string) and in the second case a zero-terminated array of XMLCh's.
You need to use the first to get data from your XML file into the native character set to display in a GUI. In my case I build clickable lists of items in my Mac and Windows app from stuff that is spelled out in an XML file, and the labels of buttons and so on are taken from element attributes.
You need the second to tranform C string constants into DOMStrings so
you can use them to find elements and attributes - the DOMString
constructor takes an XMLCh* as its parameter.
For the first case, my first try at rolling my GUI went something like this (my actual "Button" class is something a little more complicated but I'm leaving out the details):
This actually functioned quite well in my code, and I could create really complex UI arrangements with it. But a problem surfaced when I ran the Mac version of my program under the Spotlight memory debugger from Onyx Technology (this is similar to Purify and Boundschecker that are used on other platforms). It leaked memory like a seive. In fact, I was measuring the leaks log from running the program for a few minutes in terms of hundreds of kilobytes, and as I would improve the leaks the log file would shrink by tens or hundreds of K.
(Another problem is that the implementation of atomic decrement in the MacOSPlatformUtils.cpp as distributed is incorrect so none of the refcounts ever hit zero and nothing at all ever gets deallocated - anyone trying out a Mac port needs to change that. I'll be posting my Mac and Windows Codewarrior patches Real Soon Now).
My next try was to save the pointer and deallocate it after I was done:
This at least doesn't leak memory but it's really tedious. The code is also really ugly.
A further problem I get into later is that it's not exception-safe; it will still leak memory if an exception is thrown in one of the constructors. That wasn't my first concern but I've been putting a lot of study into becoming a better C++ programmer and I'm trying to move towards complete exception safety.
A considerable improvement is to make all the element names and attribute values public member variables in a class. That way they can be used like named C string constants. It also makes the code a little smaller, a lot cleaner and avoids trouble with mistyped names (this is only really useful if you have a fixed DTD that you're working with; more general-purpose XML programs may not find it useful to have all your strings as members in a fixed class).
My bag of strings is defined as follows:
The problem now is how to initialize the const DOMStrings without
leaking memory (in my initial definition of the class, sGet didn't
return a const reference, but a non-const reference to a class all of
whose members were const. Even so, I think it's important to be able to
properly initialize const members even when the initialization is
complicated.)
My first try was this:
the sGet function just returns the singleton in the usual way:
Well this improves a lot of my code but is still a bit ugly. I hate
using const_cast, but I need to hold onto the result of
XMLString::transcode and delete it after constructing the actual value,
and couldn't see a decent way to do this in an initializer list. (It's
also not exception safe, although in this case it's not too bad, as
failing to initialize these strings would abort the program so leaking
one string isn't a big deal).
Effective C++: 50 Specific Ways to Improve Your Programs and Design (2nd Edition)
[
Buy at Amazon]
[
Buy at Powell's]
After reading up on auto_ptr in Scott Meyers'
Effective C++ I started out to improve
the above code just a little bit.
auto_ptr is a class that you usually create on the stack or as a member
variable. It takes a pointer in its constructor. auto_ptr's destructor
deletes the pointer it holds. auto_ptr::operator-> and
operator* are
defined to do what they do with the pointer. Sometimes you need to
actually use the pointer itself as a value, and for that it has the
auto_ptr::get() function. auto_ptr's
help make your code exception safe
because their destructors are guaranteed to be called when they go out
of scope (including when an exception is thrown) which is not the case
with a naked pointer.
So next I had
This makes the code a little cleaner, and exception safe... but hey! now there's a way to do it in the initializer list. (It's also wrong - read all the way to the end!):
This worked great for me and resulted in an empty constructor body. I like putting things in initializer lists, and if you don't use them you should - I worked in a place where they had a single application with 70 MB of C++ and almost nothing was initialized in an initializer list - and it was buggy as all get out.
It happened that at that same company there was a single really sharp programmer named Haim Zamir who was so religious about writing correct C++ code that he coded an algorithm that got better results if it could use more memory (this was on a Mac where programs run in fixed memory allocations and might not have virtual memory), but the memory usage depended in a very complicated way on some input parameters so it really wasn't possible to predict ahead of time how much memory would be used.
So he implemented his algorithm so that it could throw exceptions anywhere without leaking, and just kept retrying repeatedly, starting with the "highest quality" setting, catching exceptions, turning down the quality and retrying until he could run all the way through. Can your code do that? Mine can't, not yet, but I'm determined to get there.
I found auto_ptr's to be very nice to use and use them throughout my
code now, so much so that if I allocate a pointer in a constructor, I
always construct an auto_ptr instead and now just don't bother
mentioning the pointer in the destructor.
(Another tip - when using initializer lists, you must write the list in the same order that the members are listed in your class declaration, because that's the order the compiler constructs them. You can write it out of order, but they will still be constructed in order. Being correct about the order of your member initialization allows a later member to take earlier members as parameters in its constructor).
Anyway, back to our regular program... auto_ptr's
are great, and for
most purposed will greatly improve the reliability of your code. Using
a corresponding class to acquire mutexes is important rather than
explicitly acquiring and releasing a lock, as this prevents a lock from
staying held in the event of an exception.
But in our case here, using the auto_ptr is just plain wrong.
It happens that it works on my compiler, and maybe it will usually work to provide upward compatibility with older C++ practice, but I know for sure in the general case it's wrong.
The problem is that you need to call delete [] on a pointer to an array,
not just delete. delete [] guarantees
that the destructor is called on
each element of the array, which is not important in the case of
primitive data types, but more importantly a pointer to a char array is
allocated with the new [] operator ( like
char *foo = new char[ 10 ]; )
rather than simple new, and since the number of array elements needs to
be stored somewhere it is possible that the pointer passed to delete
may not point to the beginning of the actual heap block that was given
to operator new [] by the underlying memory manager (usually
malloc).
More Effective C++: 35 New Ways to Improve Your Programs and Designs
[
Buy at Amazon]
[
Buy at Powell's]
It may be that the number of array elements is stored separately, so that doing this doesn't corrupt the heap, but even so you'll leak memory as that number stays allocated in some magic place known about only by the compiler vendor.
I looked all through Effective C++,
More Effective C++, and Bjarne Stroustrup's
The C++ Programming Language and didn't find any mention of using
auto_ptr's for arrays. Because auto_ptr is a template, I have its
source code with my development system, and ~auto_ptr simply calls
delete on the pointer it possesses as a member.
I asked about this on comp.lang.c++.moderated yesterday under the topic "auto_ptr's of arrays" and the basic answer was "no, you can't use auto_ptr for array pointers."
It was suggested that the reason auto_ptr's of arrays
(it would probably
need to be a separate class, like auto_array or
auto_array_ptr) weren't
provided in the standard library is that you can do most of what you'd
need to do with them with the vector template.
vector will do for most such purposes, but it can't
manage the results of transcoding C strings and DOMStrings. To
do what I have been doing, one would really want to roll a custom
template that looked just like auto_ptr except that it calls
delete []
on its member.
The best solution I've seen was suggested to me by Ted Byers,
rtbyers at bconnex.net - what he suggested was to write an inline function
that would transcode a DOMString, construct a C++ standard library
string from it, and delete the pointer returned from transcode. This is
what I have in the following, with a little bit added for exception
safety (I use namespaces, which is not normally done in the Xerces
library, but it would be easy to remove the namespace):
So now whenever I need to construct a button I just do the following:
Exceptional C++: 47 Engineering Puzzles, Programming Problems, and Solutions
[
Buy at Amazon]
[
Buy at Powell's]
Much cleaner and exception safe too. The only thing that really needs
to be done still is to test ( attrVal != NULL ) when it is constructed.
It would be null if the element did noy have the attribute.
I had thought of doing something very similar to manage the result of
XMLString::transcode(), except that I would return a
vector< XMLCh >.
The problem is you'd need to access the data of the vector as a raw
pointer; I think vector::front returns a reference to the first element
(haven't tried it) but I'm leary of ever using pointers inside a vector
because it can move its contents in the even it grows (I know this
wouldn't happen in our case, I just think it's a bad habit to get into).
Using a vector< XMLCh > is also inefficient because you'd be copying the
string passed back by transcode(), rather than using the string itself.
This isn't a big deal in my conversion to std::string because all the
functions I normally use those strings with take std::string or const
std::string & as parameters so I'd be constructing temporaries anyway.
It happens that in my program I never need to keep the result of
XMLString::transcode() around as a raw array anyway - you may need to
but for me it works great to take it straight to a DOMString:
And now the constructor to MyDOMStrings becomes:
and that makes it very simple.
Well I guess that about beats the subject to death. This is the end result of working with the Xerces library for six months, and although I've been concentrating on other things I've been trying to incrementally improve the quality of my code and learning to write better C++.
One last question I have though, is how I can say "using
MCLib::StringToDomString" in a class initializer list so I don't have to
keep repeating the MCLib:: in each initializer. Haven't figured that
one out yet (this wouldn't be necessary if you don't put it in a
namespace).
Michael D. Crawford GoingWare Inc. - Expert Software Development and Consulting http://www.goingware.com crawford@goingware.com Tilting at Windmills for a Better Tomorrow.
The answer to my last question - wrap the constructor in its own namespace declaration and put a using clause outside the constructor. I don't like to do "using" globally, which would also work:
Voting for GoingWare at The Programming Pages will encourage more people to read these articles.
Please see my Reciprocal Links page for more information