A Learning Experience: Hack and Slash

Hello hello dear readers! Sorry that I left you for a few days there, things stalled a little at the end of last week with my projects and I didn't have too much to say for myself.

To-day though there is good news! After running into some snags with reformatting the data given to me in the last call from my Section Spinner, I've set out to parse it rather brute-force and get what I want from it. The HTML snippet returned is a relatively well formatted piece, but does have some quirks. Right now I'm hacking and slashing my way through it with a standard SAX parser.

For the record, I did try to use TagSoup, but kept getting hung up on the same errors that I did with the normal SAX parser. I can only surmise that I'm doing it wrong, but as for doing it right... well, there's nothing good on the internet or concise documentation to provide that easily, so I'm just calling it off for now. Ha! If I do ever revisit it and/or get it working, I'll toss some code up to help any other wayward travellers.

During a brief test, it looks like I've managed to fully traverse the document! The following had to be done:

Due to XML not liking there to be more than one top-level element, I had to take the string I read the HTML into and surround it with <div></div> tags. That made it so that everything was read and not just the first set of top-level items.
Second was to do a replace on   since XML will throw exceptions when it sees them.
Next in the series of replace calls, all / > had to become /> as apparently that white space will make SAX ragequit without warning. Good to know!
Speaking of same-line closing tags, there was one unclosed tag that needed a special replace written just for it. For HTML from and untamed source, just having one was pretty ace.
Another thing SAX doesn't like are 'attributes' without values attached to them. In my case, the <option selected value="..."> really made it mad. Add another replace for "selected " to "".
Last but not least, I had to get rid of every ampersand. That was the one 'given' for all this alteration, haha.

So, not exactly an elegant series of solutions, but functional ones. I've only tested one path, so I'll have to make sure it's robust enough to stand on its feet with just these changes and then write the actual custom parser for the book results. That is, unfortunately, going to be a bit of a brute-force mess too. C'est la vie, I guess. We'll worry about making everything better once we are functional.

I'm getting clearance this week to make an XML repository for all my currently hardcoded values to be dumped. I'm pretty stoked about this, if only because dynamic is key. If and when I orphan this app, it needs to be able to stay on top of things, and if they don't have someone to change around hardcoded values... well, they shouldn't need to. So! That's a definite plus, and what I need for the next branch of functionality to program.

I also have a partner on my project now, who is a pretty cool guy. He's a veteran programmer with 12+ years of experience with web dev and no experience on the Android. We're going to be putting our heads together on this as best we can, probably starting next week. I don't know that he was entirely necessary, but the more the merrier, as they say!

So there's your update for the day! I'll try to stay more on top of things, lovely readers! Don't want to leave you hanging!

A Learning Experience

Tuesday, November 16, 2010

Hack and Slash

No comments:

Post a Comment