A mini-tutorial on how to write a simple
SAX
parser using SaxObjC's SaxMethodCallHandler. The example implements a
very simple RSS newsfeed
parser.
This example is the "follow-up" to rss2plist1,
which uses a "raw" SAX handler and needs to check the tags itself.
Sources
rss2plist2.m - a toolslashdot.rss - a SlashDot example feed
ToDo: the following is a copy of rss2plist1.m ...
Introduction
When writing an XML processor using SAX, you are dealing with two entities:
the SAX parser (in SAX slang called a "reader")the SAX handler
The parser is the part which actually reads an XML file and produces a
sequence of "SAX events" which are sent to the handler. If you know AppKit,
a SAX handler is basically a "delegate" of the parser.
So for doing XML processing, you only need to know how to instantiate a parser
and how to write a handler dealing with SAX events.
What do we want to parse ?
The example implements a simple RSS parser which collects just the RSS item
information and gives it back to the processor as an NSArray of
NSDictionaries.
Sample RSS section:
...
<item rdf:about="http://slashdot.org/article.pl?sid=03/01/22/1413238">
<title>Elect Steve Jobs President of the United States</title>
<link>http://slashdot.org/article.pl?sid=03/01/22/1413238</link>
<description>
Will Foster writes "There is a groundswell of support for electing Steve Jobs
President of the United States." I'll vote for him if I can write in my vote
-- ...
</description>
<dc:subject<humor</dc:subject>
<dc:date<2003-01-22T23:12:48+00:00</dc:date>
...
</item>
...
1. Instantiating the Parser
SaxObjC parsers are usually implemented as bundles and are managed by the
SaxXMLReaderFactory class which can locate and load an appropriate SAX
parser bundle for you.
id parser;
parser = [[SaxXMLReaderFactory standardXMLReaderFactory]
createXMLReaderForMimeType:@"text/xml"];
You can reuse a parser instantiated like this as many times as you which,
but you can use it only in one thread (the object itself is not reentrant).
2. Writing a SAX Handler
To do the actual XML processing, you need to write a SAX handler class. What
we show here is only a very simplified one for RSS, but it shows the concepts
pretty well.
SAX handlers usually inherit from the SaxDefaultHandler class, which
already implements all SAX handler protocols with (usually empty) default
implementations.
So do we:
@interface RSSSaxHandler : SaxDefaultHandler
{
NSMutableArray *entries;
/* parsing state */
NSMutableDictionary *entry;
BOOL isInItem; /* are we inside an 'item' tag ? */
NSString *value; /* the (PCDATA) content of a tag */
}
- (NSArray *)rssEntries;
@end
We have one array variable entries for keeping the results of the
processing. The other variables are required for tracking where we are in
the XML document and for collecting data, see below ...
The usual stuff, @implementation for implementing the handler class, setting
up some objects used for processing, ensure that they are correctly
deallocated ...:
@implementation RSSSaxHandler
- (id)init {
if ((self = [super init])) {
self->entries = [[NSMutableArray alloc] initWithCapacity:16];
self->entry = [[NSMutableDictionary alloc] initWithCapacity:8];
}
return self;
}
- (void)dealloc {
[self->entry release];
[self->entries release];
[super dealloc];
}
If the parsing is done, we have collected all <item> information in the
entries array of the handler. We reuse that array for each parsing
invocation, so we give back a copy of the array in the results accessor
called "-rssEntries":
- (NSArray *)rssEntries {
return [[self->entries copy] autorelease];
}
The SAX reader sends the handler a startDocument message prior parsing and a
endDocument message if it's done. Those callbacks are useful to setup and tear
down per document processing state. In this startDocument implementation we
ensure that the entries is empty (eg if a processing error occurred in the
previous run, it might contains partial results).
- (void)startDocument {
[self->entries removeAllObjects];
self->isInItem = NO;
}
The SAX parser triggers callbacks
if it encounteres tags, processing instructions, content, errors, namespace
declarations, etc. All callbacks are implemented by our superclass
SaxDefaultHandler, so we only need to override the callbacks we are interested
in: tags and content.
If the SAX parser encounteres a start tag (eg <item>) it calls the
startElement callback and passes in the tagname - as it exists in the file
in rawName, and after XML namespace processing in localName and
ns. The attributes of the tag are provided in the attributes
object - but since RSS doesn't use any tag attributes, we can ignore them.
In the case of an <item> tag, we clean the entry record and
place a marker (isInItem). The entry dictionary is used to
collect the information of all subtags of <item>.
We also clean the value on any tag we enter, the variable is explained
in the -characters callback.
- (void)startElement:(NSString *)_localName
namespace:(NSString *)_ns
rawName:(NSString *)_rawName
attributes:(id)_attributes
{
if ([_localName isEqualToString:@"item"]) {
[self->entry removeAllObjects];
self->isInItem = YES;
}
/* always reset content when entering a new tag */
[self->value release]; self->value = nil;
}
Three cases: a) the item section is closed by a </item>, b) we are
inside of an item section, c) we are outside of an item section.
In case a) we add the record containing the item information to the
entries array. We make a copy of entry since we reuse that
dictionary for any item.
In case b) we use the tagname of the subtag as the key for the collected
character data and store that data inside the entry record. We are
only adding the key, if the subtag actually contained some character data.
In case c) we do nothing ;-) we are only interested in information contained
inside an <item> section.
- (void)endElement:(NSString *)_localName
namespace:(NSString *)_ns
rawName:(NSString *)_rawName
{
if ([_localName isEqualToString:@"item"]) {
/* found end of item */
self->isInItem = NO;
[self->entries addObject:[[self->entry copy] autorelease]];
}
else if (self->isInItem) {
/* any tag inside of an item is a key for the entry dict */
if (self->value) {
/* if we collected a PCDATA value, add it */
[self->entry setObject:self->value forKey:_localName];
[self->value release]; self->value = nil;
}
}
}
Finally the PCDATA (non-tag content) callback. If we encounter a
<i>hello</i> the SAX parser will call the -characters callback
with "hello" as the string.
For our example we collect all PCDATA in the value variable for later
addition to the entry record in the -endElement callback.
Attention!: it is not guaranteed that the SAX parser calls the callback only
once ! Eg you might well get two calls like characters:"he" and
characters:"llo". This complicates the handler (we need to append the string
if we already stored one), but makes it easier to write parsers.
By checking whether we are in an <item> section, we ensure that we don't
collect unnecessary content.
- (void)characters:(unichar *)_chars length:(int)_len {
NSString *s;
if (!self->isInItem) return;
s = [[NSString alloc] initWithCharacters:_chars length:_len];
if (self->value) {
self->value = [[self->value stringByAppendingString:s] copy];
[s release];
}
else
self->value = s;
}
Close the implementation, that's it ;-)
@end /* RSSSaxHandler */
3. Connecting the SAX Handler to the Parser
Now that you have the parser and a handler, you need to connect the two:
sax = [[[RSSSaxHandler alloc] init] autorelease];
[parser setContentHandler:sax];
[parser setErrorHandler:sax];
A SAX parser can actually have different kinds of handlers - eg separate
handlers for errors, for DTD information, for the content - but in practice
you almost always use a single handler which inherits from the
SaxDefaultHandler class.
4. Start the Parsing
Easy. Let the parser do the parsing by passing it a URL, then query the
results from the handler.
NSArray *entries;
[parser parseFromSource:[NSURL URLWithString:@"file:///...."]];
entries = [sax rssEntries];
Note: You can also pass the parser an NSData or NSString object containing
an XML document for parsing.
Note: You can also parse "plist", "pyx", "iCalendar" and "vCalendar" files
using specialized SAX parsers coming with SaxObjC ! (SAX is good for
processing a lot of different XML "like" structured text formats).
What's next ?
"Raw" SAX handlers are usually only used if you need to process very large
documents or if you need to process documents before you have the whole
data available (in a streaming fashion).
So for doing "real" work, take a look at SaxObjectDecoder or DOM - much
easier.
Note: before you start implementing an RSS reader using the tutorial
as a starting point, take a look at the excellent
MulleNewz
application available for MacOSX, for .NET and for GNUstep !
Written by
Helge Heß