<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Something Similar &#187; rfeedparser</title>
	<atom:link href="http://somethingsimilar.com/category/rfeedparser/feed/" rel="self" type="application/rss+xml" />
	<link>http://somethingsimilar.com</link>
	<description>Just like it.</description>
	<lastBuildDate>Tue, 08 Sep 2009 16:30:30 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>libiconv and rFeedParser</title>
		<link>http://somethingsimilar.com/2007/07/22/libiconv-and-rfeedparser/</link>
		<comments>http://somethingsimilar.com/2007/07/22/libiconv-and-rfeedparser/#comments</comments>
		<pubDate>Sun, 22 Jul 2007 21:43:00 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[iconv]]></category>
		<category><![CDATA[rfeedparser]]></category>

		<guid isPermaLink="false">http://www.somethingsimilar.com/wordpress/2007/07/22/libiconv-and-rfeedparser/</guid>
		<description><![CDATA[I got a chance to read libiconv&#8217;s DESIGN document (found in the tarball) and noticed this passage:


  Extensibility
  
  The dlopen(3) approach is good for guaranteeing extensibility if the iconv
   implementation is distributed without source. (Or when, as in glibc, you
   cannot rebuild iconv without rebuilding your libc, [...]]]></description>
			<content:encoded><![CDATA[<p>I got a chance to read <a href="http://www.gnu.org/software/libiconv/">libiconv</a>&#8217;s DESIGN document (found in the <a href="http://ftp.gnu.org/pub/gnu/libiconv/libiconv-1.11.tar.gz">tarball</a>) and noticed this passage:</p>

<blockquote>
  <p>Extensibility</p>
  
  <p>The dlopen(3) approach is good for guaranteeing extensibility if the iconv
   implementation is distributed without source. (Or when, as in glibc, you
   cannot rebuild iconv without rebuilding your libc, thus possibly
   destabilizing your system.)</p>
  
  <p>The libiconv package achieves extensibility through the LGPL license:
   Every user has access to the source of the package and can extend and
   replace just libiconv.so.</p>
  
  <p>The places which have to be modified when a new encoding is added are as
   follows: add an #include statement in iconv.c, add an entry in the table in
   iconv.c, and of course, update the README and iconv_open.3 manual page.</p>
</blockquote>

<p>The upshot of this is that adding new encodings through some <code>iconv-encodings</code> package will be a pain in the ass and would cause breakage in unexpected, fascinating ways.  But, there are smarter people than I out there, and maybe something can still be done.</p>

<p>Of course, this also means that we would not get FreeBSD &#8220;for free&#8221; (though, I imagine <code>xmlparser</code> doesn&#8217;t build on it, anyway) and we would have to come up with a solution for it as well.  </p>

<p>What a mess.</p>
]]></content:encoded>
			<wfw:commentRss>http://somethingsimilar.com/2007/07/22/libiconv-and-rfeedparser/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On rFeedParser</title>
		<link>http://somethingsimilar.com/2007/07/22/on-rfeedparser/</link>
		<comments>http://somethingsimilar.com/2007/07/22/on-rfeedparser/#comments</comments>
		<pubDate>Sun, 22 Jul 2007 10:52:00 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[rfeedparser]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://www.somethingsimilar.com/wordpress/2007/07/22/on-rfeedparser/</guid>
		<description><![CDATA[This post is huge but I have not the time to make it smaller.  I&#8217;m so very tired.

A Quick Introduction

rFeedParser is a RSS/Atom feed parser.  It is a translation of Mark Pilgrim&#8217;s feedparser from Python to Ruby. It behaves almost exactly the same and passes somewhere near 99% of the tests on a [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post is huge but I have not the time to make it smaller.  I&#8217;m so very tired.</em></p>

<h2>A Quick Introduction</h2>

<p><a href="http://rfeedparser.rubyforge.org">rFeedParser</a> is a RSS/Atom feed parser.  It is a translation of Mark Pilgrim&#8217;s <a href="http://feedparser.org">feedparser</a> from Python to Ruby. It behaves almost exactly the same and passes somewhere near 99% of the tests on a Ubuntu machine.  Other platforms suffer from lesser success rates due to differing Iconv installations.  The <a href="http://feedparser.org/docs">feedparser documentation</a> applies to this work, and almost any deviation from it should be considered a bug. Please <a href="http://rubyforge.org/tracker/?atid=12738&amp;group_id=3309&amp;func=browse">file any bugs</a> you find.</p>

<p>This project was inspired by Sam Ruby&#8217;s <a href="http://intertwingly.net/blog/2005/10/30/Testing-FeedTools-Dynamically">pirate testing</a> idea, one that I hope catches on beyond these feed parsers.</p>

<h2>The Basics</h2>

<pre><code>require 'rubygems'
require 'rfeedparser'

feed = FeedParser.parse('somefeedurlorfilepath')

first = feed.entries.collect{|e| e['title'] }
second = feed['entries'].collect{|e| e.title }
if first == second
  puts "This is handy when dealing with e['id'], the guid of an item/entry"
end
</code></pre>

<h2>Installation</h2>

<p>Agh.  rFeedParser is a monster.  Tons of dependencies, some overlapping in areas, and one &#8220;not nice&#8221; dependency.  The &#8220;not nice&#8221; dependency is on Yoshida Masato&#8217;s <a href="http://www.yoshidam.net/Ruby.html#xmlparser">xmlparser</a>.  </p>

<p>You can either install it by hand (be sure to add <code>return</code> in front of <code>stream</code> in saxdriver.rb, line 171), or install through &#8220;<code>sudo apt-get install libxml-parser-ruby1.8</code>&#8221; if you&#8217;re on Ubuntu or another Debian-based Linux, or through the xmlparser gem that I put together that seems to work on only &#8220;some&#8221; Mac machines but all Linux boxes.  <code>xmlparser</code>, of course, depends on the <a href="http://expat.sourceforge.net/">Expat</a> XML parsing library, and be sure to install the <code>-dev</code>, <code>-devel</code> or whatever version has the full headers and libraries available for linking against if you install through MacPorts or by hand.  </p>

<h2>The Latest and Greatest</h2>

<p>The latest version is 0.9.93&#8230; Okay, really, the latest version is  <a href="http://rubyforge.org/frs/?group_id=3309&amp;release_id=13153">0.9.931</a>. There was a minor bug that, if it hadn&#8217;t been for the guilt of having put off the user who had brought it to me, I wouldn&#8217;t have worried about forgetting in 0.9.93. He/she (no name, just an email address) had been so nice about it.. So, future users, take note: if you see a bug I haven&#8217;t fixed yet, guilt seems to work.  Also, bribery.  Patches certainly don&#8217;t hurt.</p>

<p>The 0.9.93 and 0.9.931 updates do a number of things:</p>

<ul>
<li>Fix a horrendous error when handling <code>content:encoded</code>, <code>body</code>, <code>xhtml:body</code>, <code>prodlink</code> and <code>fullitem</code></li>
<li>Added some further support of <a href="http://search.yahoo.com/mrss">Yahoo Media RSS</a>.  I&#8217;ve added support for <code>media:thumbnail</code> and <code>media:content</code> (the latter, only in its &#8220;two tag&#8221; form).  This came directly from a requirement in our project at <a href="http://activemediagroup.com">work</a>. <a href="http://diveintomark.org">Mark</a>, you should <a href="http://intertwingly.net/blog/2004/10/08/Unbundling-Pirate-Tests#c1097268531">admire my ability to get paid for this</a>. </li>
<li>Fixed up the lame ass headers code I had going.  I don&#8217;t remember what I was on when I wrote it, but it must have been fantastic.</li>
<li>py2rtime had some major bugs that I can&#8217;t understand how they passed the tests.  I will give a dollar to anyone who figures it out, mainly because I don&#8217;t want to deal with it. See revision 57, and compare to both revision 58 and the current code in the repository.</li>
<li>Use rchardet 1.1. There was a rather serious bug in 1.0.  Never use <code>gsub!</code> ever, ever, ever, ever. <small>Maybe sometimes.</small></li>
<li>Some messed up indentation.  Neither vim nor Textmate can indent ruby code well, it seems.  Or maybe I write weird looking code.  Luckily, I&#8217;m reading the <a href="http://en.wikipedia.org/wiki/Compilers:_Principles%2C_Techniques%2C_and_Tools">Dragon book</a> and <a href="http://steve-yegge.blogspot.com/2007/06/rich-programmer-food.html">learning things</a> and I may decide to tackle it.</li>
<li><code>ForgivingURI</code> continues to be something I desperately want to see in the Ruby core libraries.  <code>URI.parse</code> shouldn&#8217;t puke everytime some loser fucks up his syntax.  At least, give me something more than &#8220;bad URI(is not URI?)&#8221; no matter what the problem is.  Something I stole from <a href="http://sporkmonger.com">Bob Aman</a> <a href="http://sporkmonger.com/projects/feedtools">FeedTools</a>.  </li>
</ul>

<p>Speaking of patches, those interested in helping development can find a <a href="http://bazaar-vcs.org/">bzr</a> repositories for rfeedparser on this <a href="http://somethingsimilar.com/code/bzr/">very site</a>.  This is probably dumb, and a bandwidth hog, but I&#8217;m too lazy to either a) go to my workplace and log into my Ubuntu box with bzr-svn or b) patch svn on the Mac laptop I&#8217;m currently writing on to put it up on <a href="http://rubyforge.org">rubyforge</a>.</p>

<h2>Gotchas, Monkey Patches and Other Disgusting Things</h2>

<p>Now, on to the ugly.  </p>

<p>As Sam <a href="http://intertwingly.net/blog/2007/07/21/rFeedParser">points out me pointing out</a>, the original <a href="http://feedparser.org">feedparser</a> tests require the parsed times to be stored in Python&#8217;s 9-tuple format. For those of you who aren&#8217;t jargon whores, that&#8217;s basically a list of 9 integers <a href="http://feedparser.org/docs/date-parsing.html">specifying the date</a>. Unfortunately, Ruby doesn&#8217;t have a method in <code>Time</code> that can take that format.  The solution, for our purposes, is to use the <code>py2rtime</code> top-level method I wrote that does the (very easy) task of putting the 9-tuple in a form <code>Time.utc</code> can understand.  (Also, Sam&#8217;s suggestion of naming it <code>feeddate</code> sounds pretty damn good). </p>

<p>Also, the <code>SGMLParser</code> in <code>HTMLTools</code> is kind of broken.  The <code>Regexp</code>s don&#8217;t really work as intended (which I really need to send in patches for) and its really, really not UTF-8 safe.  Oh, god. Making it UTF-8 safe involved code so ugly, so treacherous, that I will probably get cancer from it.  </p>

<p>The UTF-8 stuff, of course, isn&#8217;t the developers fault.  Ruby&#8217;s encoding support sucks so much that it seems quite a few people thought it would make writing a decent feed parser nearly impossible.  </p>

<p>So, how did I do it?  Through beta software, overlapping dependencies, relying on <code>iconv</code> (which is always terribly configured in any operating system) and a total disregard for passing the encoding tests.  That&#8217;s right, <a href="http://rfeedparser.rubyforge.org">rfp</a> uses both the <code>character-encodings</code> gem and ActiveSupport and we still have dozens of failures and errors, the number of each depending on what OS we&#8217;re on! </p>

<p>So, most of the former Eastern Bloc just won&#8217;t get to use <a href="http://rfeedparser.rubyforge.org">rFeedParser</a> for a while. Sorry. (Hey, Hungary, it supports your datetimes! Does that make you feel better?) </p>

<p>If someone could magic up some sort of <code>iconv-encodings</code> gem or tarball that can give us a standard <code>iconv</code> install to work with, we might be able to make the encoding situation better.  I would do it, however, I have got shit to do that doesn&#8217;t make me want to gather up shove ballpoint pens into my brainstem. Or slit my wrists with codepoints.  (I&#8217;m pretty sure I could come up with a physically realizable way to approximate the latter.)  Sigh, maybe I&#8217;ll get to it later, but I&#8217;d love to have some help.</p>

<p>On to the straight-up monkey patches.</p>

<p>There&#8217;s a few on <code>Hpricot</code>, but they have very little impact.  Maybe making Hpricot load a bit slower on boot due to the huge element lists I put in there.  Also, there is a method called <code>Hpricot.scrub</code>, but it is no longer the <code>Hpricot.scrub</code> that <a href="http://underpantsgnome.com/2007/01/20/hpricot-scrub/">you know so well</a>. It originally was, but I needed to do some extra things that added a couple of scans on top of the two already in there and, suddenly, it was a bottleneck. So, apologies for the confusing name. </p>

<p>(Jeff Hodges&#8217; Trivia Time:  The guy who wrote Hpricot#scrub, Michael Moen, is the guy who &#8220;officially&#8221; put Jeff&#8217;s name in for the position at ICTV.  He and Jeff work together on the same Ruby on Rails application as members of the ActiveMedia Group.  When discussing new problems with Michael, Jeff is often boggled by Michael&#8217;s clarity of thought.)</p>

<p>Oh, and one more monkey patch.  <code>xmlparser</code> doesn&#8217;t return the attributes of the XML tags as a <code>Hash</code>, but <code>SGMLParser</code> does and it would have been pretty damn handy if it did, so I made it do that.  The code is in <code>better_attributelist.rb</code> (my filenames are full of ego), and it could be done better, but it suits my purpose.</p>

<p>Other ugly things: <code>ForgivingURI</code> (as mentioned above) and the inconsistent naming of methods that came about after a few bad nights of hacking through Ruby&#8217;s inheritance problems.  I fixed the actual architectural problem long ago, but left the terrible names in there.  So, the <code>self.fooThing</code> and <code>_hasDumbPrefix</code> stuff is my bad.  Except for the methods in <code>FeedParserMixin</code> that are named after XML tags.  Those names are prefixed with &#8216;_&#8217; (and is even in the original Python code) in order to work around the differences between the XML parser and SGML parser.  </p>

<p>I should also mention the metric ass load of datetime parsing regular expressions I had to write.  Another set of patches I need to write, this time to Ruby core.  I don&#8217;t even want to discuss them.  Go look at <code>time_helpers.rb</code> and see how many times I made one <a href="http://regex.info/blog/2006-09-15/247">problem into two</a>.  My code is grody.</p>

<h2>The Future of the Tests</h2>

<p>Sam brings up the idea of making the tests from the Python <a href="http://feedparser.org">feedparser</a> less, er, Pythonic.  <del>We could speed up response time  If we change the expectations for dates to some method calling a 9-tuple (or rather, a 9-list or 9-Array, or 9-some-datastructure-with-brackets-not-parentheses.) we could get an instant win.</del> <ins>I have no idea what I was trying to say here.</ins></p>

<p>Also, the use of <code>u''</code>, <code>u""</code> and the <code>\unn</code> or <code>\unnnn</code> format for non-ASCII characters in Python had to be hacked around with regular expressions.  While the <code>character-encodings</code> gem provides something like the <code>u''</code> syntax, the <code>\u</code> characters are completely unsupported.  It&#8217;s really ugly, and kind of painful, esp. if a developer never had much experience with Python. Fortunately, I had a good deal but probably not enough considering the amount of time it took to write those Regexps.  </p>

<p>The XML test files are a huge boon and make them more general would make it easier to maintain code equivalence across languages and allow those who are more comfortable in one language to help outside of that language&#8217;s project.  But, this is all just blue sky stuff for the moment. </p>

<h2>And Spent</h2>

<p>This post is huge and I need to stop writing.  I don&#8217;t think I&#8217;ve talked about everything I wanted to, but I&#8217;m shot.  rFeedParser is nice and you should use it and tell other people to use it.  Questions and comments are welcome.</p>

<p><em>Update</em>: A few grammar and spelling clean ups.  Sucktasia on ice.</p>
]]></content:encoded>
			<wfw:commentRss>http://somethingsimilar.com/2007/07/22/on-rfeedparser/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Friggin&#8217; Module Bundles</title>
		<link>http://somethingsimilar.com/2007/06/26/friggin-module-bundles/</link>
		<comments>http://somethingsimilar.com/2007/06/26/friggin-module-bundles/#comments</comments>
		<pubDate>Tue, 26 Jun 2007 06:59:08 +0000</pubDate>
		<dc:creator>Jeff</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[rfeedparser]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://www.somethingsimilar.com/wordpress/2007/06/26/friggin-module-bundles/</guid>
		<description><![CDATA[What was one of the things I wanted the most when I started writing rFeedParser? 

This.
]]></description>
			<content:encoded><![CDATA[<p>What was one of the things I wanted the most when I started writing <a href="http://rfeedparser.rubyforge.org">rFeedParser</a>?</p> 

<p><a href="http://project.ioni.st/post/1305#snippet_1305">This</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://somethingsimilar.com/2007/06/26/friggin-module-bundles/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
