Removing entities from HTML in Cocoa

To display accented characters and certain symbols in a HTML or XML document you need to encode them. For example the copyright symbol © is represented in HTML as ©

Applications like NewsMac Pro need to be able to decode these entities and translate them to the appropriate character. Straightforward you might think, but actually it isn’t. There are multiple ways in which characters can be encoded, as before with a textual name, but also with a decimal or hex value. In NewsMac Pro I used to use NSAttributtedString’s initWithHTML method, however for what ever reason this seem to lock up under Tiger, so I had to find an alternative solution. I thought I’d post the following code to help out other developers because if you go searching on this topic you will most likely get people telling you to use the NSAttributedString method.

This probably isn’t the most elegant bit of code ever, but it serves its purpose:

+ (NSString *) decodeCharacterEntitiesIn:(NSString *)source
  if(!source) return nil;
  else if([source rangeOfString: @"&"].location == NSNotFound) return source;
    NSArray *codes = [NSArray arrayWithObjects: 
      @" ", @"¡", @"¢", @"£", @"¤", @"¥", @"¦",
      @"§", @"¨", @"©", @"ª", @"«", @"¬", @"­", @"®",
      @"¯", @"°", @"±", @"²", @"³", @"´", @"µ",
      @"¶", @"·", @"¸", @"¹", @"º", @"»", @"¼",
      @"½", @"¾", @"¿", @"À", @"Á", @"Â",
      @"Ã", @"Ä", @"Å", @"Æ", @"Ç", @"È",
      @"É", @"Ê", @"Ë", @"Ì", @"Í", @"Î", @"Ï",
      @"Ð", @"Ñ", @"Ò", @"Ó", @"Ô", @"Õ", @"Ö",
      @"×", @"Ø", @"Ù", @"Ú", @"Û", @"Ü", @"Ý",
      @"Þ", @"ß", @"à", @"á", @"â", @"ã", @"ä",
      @"å", @"æ", @"ç", @"è", @"é", @"ê", @"ë",
      @"ì", @"í", @"î", @"ï", @"ð", @"ñ", @"ò",
      @"ó", @"ô", @"õ", @"ö", @"÷", @"ø", @"ù",
      @"ú", @"û", @"ü", @"ý", @"þ", @"ÿ", nil];
    NSArray *highCodes = [NSArray arrayWithObjects: @"Œ",   // 338
                                                    @"œ",   // 339
                                                    @"Š",  // 352
                                                    @"š",  // 353 
                                                    @"Ÿ",    // 376
                                                    @"ˆ",    // 710
                                                    @"˜",   // 732
                                                    @"–",   // 8211
                                                    @"—",   // 8212
                                                    @"‘",   // 8216
                                                    @"’",   // 8217
                                                    @"‚",   // 8218
                                                    @"“",   // 8220
                                                    @"”",   // 8221
                                                    @"„",   // 8222
                                                    @"†",  // 8224
                                                    @"‡",  // 8225
                                                    @"…",  // 8230
                                                    @"‰",  // 8240
                                                    @"‹",  // 8249
                                                    @"›",  // 8250
                                                    @"€",    // 8364
    int highCodeNumbers[22] = { 338, 339, 352, 353, 376, 710, 732, 8211, 8212,
                              8216, 8217, 8218, 8220, 8221, 8222, 8224, 8225,
                              8230, 8240, 8249, 8250, 8364 }; // 22 ints
    // decode basic XML entities:
    NSMutableString *escaped = [NSMutableString stringWithString: 
         (NSString *)CFXMLCreateStringByUnescapingEntities (NULL, (CFStringRef)source, NULL)];

    // Html
    int i, count = [codes count];
    for(i = 0; i < count; i++)
      NSRange range = [source rangeOfString: [codes objectAtIndex: i]];
      if(range.location != NSNotFound)
        [escaped replaceOccurrencesOfString: [codes objectAtIndex: i] 
                                 withString: [NSString stringWithFormat: @"%C", 160 + i] 
                                    options: NSLiteralSearch 
                                      range: NSMakeRange(0, [escaped length])];
    count = [highCodes count];
    // Html high codes
    for(i = 0; i < count; i++)
      NSRange range = [source rangeOfString: [highCodes objectAtIndex: i]];
      if(range.location != NSNotFound)
        [escaped replaceOccurrencesOfString: [highCodes objectAtIndex: i] 
                                 withString: [NSString stringWithFormat: @"%C", highCodeNumbers[i]] 
                                    options: NSLiteralSearch 
                                      range: NSMakeRange(0, [escaped length])];
    // Decimal & Hex
    NSRange start, finish, searchRange = NSMakeRange(0, [escaped length]);
    i = 0;
    while(i < [escaped length]) { start = [escaped rangeOfString: @"&#" options: NSCaseInsensitiveSearch range: searchRange]; finish = [escaped rangeOfString: @";" options: NSCaseInsensitiveSearch range: searchRange]; if(start.location != NSNotFound && finish.location != NSNotFound && finish.location > start.location && finish.location - start.location < 5)
        NSRange entityRange = NSMakeRange(start.location, (finish.location - start.location) + 1);
        NSString *entity = [escaped substringWithRange: entityRange];     
        NSString *value = [entity substringWithRange: NSMakeRange(2, [entity length] - 2)];
        [escaped deleteCharactersInRange: entityRange];
        if([value hasPrefix: @"x"])
          unsigned int tempInt = 0;
          NSScanner *scanner = [NSScanner scannerWithString: [value substringFromIndex: 1]];
          [scanner scanHexInt: &tempInt];
          [escaped insertString: [NSString stringWithFormat: @"%C", tempInt] atIndex: entityRange.location];
          [escaped insertString: [NSString stringWithFormat: @"%C", [value intValue]] atIndex: entityRange.location];
        i = start.location;
      searchRange = NSMakeRange(i, [escaped length] - i);
    return escaped;    

The tyranny of broken HTML in RSS

One of the problems with rendering RSS content nicely is broken HTML tag pairs. It seems certain RSS generators are very careless when it comes to preparing item summaries, often chopping through the middle of link tags when snatching the first few lines of an article. This isn’t such a big deal if you’re just displaying one item, but if you’ve got a whole bunch of these displayed one after another a single broken anchor (link) tag or stray blockquote (indentation tag) can really mess things up. I really don’t want to have to get into HTML parsing but it looks like I’m not going to have much choice at this rate.

A couple of offenders I spotted today are BoingBoing and AppleMatters. There are many more though, it’s just down to luck which ones get away with it and which ones don’t.

NSURLConnection woes

I’ve been trying to improve the speed at which things download in NewsMac Pro as well as provide support for things like feeds which require authentication. The logical choice seemed to be moving from using NSURLHandle and friends to NSURLConnection which was introduced with WebKit back in OS X 10.2.7.

The first thing that struck me about NSURLConnection was that it was very light on methods – still I figured that would just make it a bit easier to use. Initially I tried using it synchronously (this means the thread that was doing the download would basically hang until the connection either finished downloading, or failed). However the performance wasn’t great, and I read on CocoaDev that this approach also leaked memory. So the other day I decided to do a pretty major overhaul of the download system to use the event driven delegate methods. That wasn’t too hard and it only took a few hours to have it up and running, but then I discovered a huge caveat that seems affect a lot of WebKit related classes – it can’t cope at all well with threads. Now in a networked application threads are essential unless you want the entire app to lock up for the duration of each burst of network activity. NSURLConnection does threading behind the scenes, but makes it very hard to actually be run itself from a thread – which is more or less necessary if you want to have multiple concurrent downloads happening.

Anyway I thought I’d solved this and performance was indeed better, then I click on a freshly downloaded channel while others were still downloading, the new headlines pop up, I click one to see it displayed in the headline browser (a WebView) and boom NewsMac crashes inside one of NSURLConnections’ threads. WTF? Clearly the WebView was creating its own NSURLConnection and that was conflicting with the one’s I’d created, but I don’t see why it should. I really hope Apple fixes this ASAP because this is just shabby. I’m now left with the choice of going back to the old way of doing things or rewriting around something like CURLHandle, which I’ve just downloaded to estimate how much work it would take to integrate into NewsMac. That essential broken classes like this make their way into the API of a shipping operating system and then remain unfixed for over a year strikes me as unacceptable, and NSURLConenction isn’t alone. At the very least Apple could provide a warning in the API that the class is still ‘experimental’.

While some of you might be horrified that NewsMac Pro seems this broken, let me reassure you that with a object oriented language and modular program like NewsMac, ripping out the engine and sticking a new one in isn’t that big of a deal – it’s just an annoyance because this is time I’d rather be spending on finishing other features.

Mixed metaphors

I’ve noticed that there might be some redundant functionality in NewsMac, the thing is I’m not sure what the preferable solution to the problem is – what’s the best way of marking things that you use a lot and want quick access to?

Originally this problem was solved by the idea of favourites – you could mark any channel as a favourite and it would appear in the favourites bar and favourites collection. Then I introduced star ratings because I thought people like to be able to grade the usefulness of a given channel for future reference. But is there really enough difference between say a 4 or 5 star rating and having something marked as a favourite for quick access – surely you’ll probably want quick access to those sites you’ve rated so highly?

NewsMac Pro further complicates things because you can create any number of folders and drag any sets of channels into them that you desire for quick access. This essentially removes the original purpose of favourites. So I’m left wondering if it’s sufficient to just drop the whole favourites concept completely and just use star ratings, which allow more granular control over likes and dislikes. I can see a situation where you might only occasionally read a certain channel but still give it a high rating because of the quality and therefore not want it in your favourites listing, but even so it seems a bit tenuous.

The other things is NewsMac Pro also introduces the concept of bookmarks and I can foresee there being confusion about the different between bookmarks and favourites because the terminology is already mixed up by the different web browsers out there. In NewsMac Pro bookmarks apply to headlines – they let you keep a reference to a specific headline and they also override the automatic history removal so a bookmarked headline will stay around indefinitely until you remove the bookmark. It probably won’t make the 1.0 release but I want to have the ability to automatically export these bookmarks to Safari too.

Anyway on the topic of NewsMac oddities, the other thing that comes to mind is the way you synchronise things with an iPod or Palm – you have to separately mark it as ‘to be synchronised’. This concept caused confusion and will be removed in NewsMac Pro – instead you can just pick any folder to be the source of synchronised channels (well with the exception of the ‘All’ smart folder because it would exceed the capacity of an iPod or Palm to try and synchronise 100s of channels!).

Brushed metal and graphite blobs, whither the HIG?

Apple has long been known for devising and popularising good solid computer user interfaces. However there has been an increasing, and worrying trend for Apple to throw conventional UI wisdom out the window to apparently either meet the whims of Steve Jobs or the marketing department. The spread of the brushed metal look, like a plague, across every flagship Mac application is one example of this. Apple’s Human Interface Guidelines, for a long time seen as the bible of user interface design, kept contorting and twisting trying to add new reasons as to why brushed metal should be used rather than the standard Aqua look. First it was anything that interacted with a digital lifestyle device (in other words excusing all iLife apps), then it was anything with a ‘source list’, excusing the Finder. At this point it’s basically a free-for-all, visual consistency be damned.

That’s not to say brushed metal doesn’t have its place in the world, iTunes, the DVD player, calculator and other simple applications that are seeking to mimic something from the real world make sense in looking a bit different because they are trying to build on people’s familiarity with those devices. But this doesn’t excuse applications like the Finder, which has been sorely lacking visually ever since OS X was released. The source list shortcut bar is useful, but is it really any different to what you could do in OS X 10.2 by dragging icons into the toolbar to act as shortcuts? Nope. It’s just a Windows XP inspired add-on, and unlike its Windows counterpart it’s arguably less useful because it’s not context sensitive. Wouldn’t it be cool if it displayed some photo editing features (like rotate, crop etc). when you were in your Picture folder? How about the ability to quickly edit an MP3s ID3 tags when you click on a music file in the Finder? This could easily be done through AppleScript and would allow for a whole new class of application development using the Finder as the backbone. But instead we just have a list of shortcut icons that just sit there hoping they might come in handy. Just like those in the dock, and on the Finder toolbar and any shortcuts you might have on your desktop. Sure choice is a good thing, but at what point does it get in the way of adding some useful features for the rest of us?

Anyway my issues with brushed metal aside, OS X 10.4 Tiger is continuing this downward spiral of user interface design. Looking at the new interface in Mail (go check it out), where Apple has done away with the old drawer and introduced a more Outlook-style sidebar, or source list as we now call them. Of course it doesn’t look like any other source list in OS X, but at this point we should be used to each Apple application going its own way. The departure of the drawer, and the ability to place it on which ever side of the window one desired, or to hide it completely will no doubt be sorely missed by some. But drawers in and of themselves always looked a bit weird anyway so I’m not entirely sure if I’ll mourn them falling from favour in Mail, which I think is probably one of the only Apple applications to actually use one aside from iCal. I’ve decided to drop the drawer in NewsMac Pro as the source list makes it unnecessary now.

One glance at the toolbar and you immediately notice something odd, aside from the new ‘unified toolbar look’, the icons are now grouped together in graphite aqua blobs (which as you can see by the colour of the window’s open/minimise/zoom widgets, is ignoring the user’s preferences). If you ask me it looks butt ugly, compared with the current crop of elegant and colour themed icons in Mail. The other oddity is the great big chunk of blank space to the left of the delete icon, hopefully icons will move in to fill this as you add more to the toolbar, but still it looks damn weird.