Quantifying Syndication Feeds Usage on the Web
How many websites provide syndication feeds?
This post is going to be a lot less research-oriented than what gets usually published on our blog. But as computer science research often relies on software engineering and its tooling, let’s expose some of this for once.
Anyway, …
Let’s find out.
I became very interested in web syndication feeds recently. It’s an old technology nobody talks about anymore, but everybody still provides them. Taking a look at Google Trends shows us how much people have lost interest in RSS since its popularity peak in 2006:
It’s an old technology people seem to have lost interest in, but can we still rely on it? Are people abandoning web syndication? With the advent of CMS (eg. the huge popularity of WordPress), static website generators and publishing platforms (eg. Medium) which all provide syndication feeds by default, RSS doesn’t look dead to me.
This is how I became interested in finding how many websites actually do provide at least one XML syndication feed. Builtwith has a pretty nifty “trends” section and it states that 33% of the 1 million most visited websites have an RSS feed. While this gives us a pretty good idea, let’s see what we could do by ourselves.
We basically have two options here:
- Actually crawling the web - a costly and lengthy process
- Relying on existing web crawl data
Enters CommonCrawl. CommonCrawl is a non-profit founded on the exciting project of crawling tons of web pages and releasing the obtained dataset publicly and for free. Their latest dump was published in November 2015 and contains 1.82 billion web pages, amounting to over 151TB of highly compressed HTML.
Working with CommonCrawl
As mentioned before, the latest Common Crawl is 151TB of data hosted on S3. We decided to process it directly from a few EC2 spot instances using the WDC Extraction Framework. Actually this choice was pretty obvious to us because the syndication feeds are not the only thing we wanted to extract from CommonCrawl. My colleagues were interested in extracting all structured data embedded in any HTML page which is exactly what the WDC framework was designed for. We also wanted to extract all HTML anchors pointing to a Wikipedia page. We forked the WDC framework and modified it to extract these different things all at once, thus only needing to go through the CommonCrawl dataset once.
Now, CommonCrawl dump consists of HTML web pages and we want to extract parts of it, eg. any <link rel="alternate" href="/blog/atom.xml" title="something" type="application/atom+xml">
with either RSS/Atom type
attribute or rel="alternate"
attribute. Although parsing HTML with regex is a bad idea, constructing a full-blown DOM tree out of each of these 1.82B pages would require far too much time and processing power.
Here’s the regular expression I used to match feeds:
(<link[^>]*(?:\s(?:type=[\"']?(application\/rss\+xml|application\/atom\+xml|application\/rss|application\/atom|application\/rdf\+xml|application\/rdf|text\/rss\+xml|text\/atom\+xml|text\/rss|text\/atom|text\/rdf\+xml|text\/rdf|text\/xml|application\/xml)[\"']?|rel=[\"']?(?:alternate)[\"']?))[^>]*>)"
The WDC framework wrote its results to our S3 bucket in csv.gz
format. For the whole run we budgeted 500USD for AWS EC2 instances. We spawned 100 c3.4xlarge EC2 instances (16 cores each) to run our modified WDC framework. Using spot instances saved us some money. It took around 30 hours to process the whole Common Crawl dump and costed less than 450USD, we were right on target.
Most interesting is of course post-processing the data, not extracting it. Here is what our results S3 bucket looks like together with a short explanation of what the result of our various extractions are:
2.3G /WDC_112015/anchors
60G /WDC_112015/feeds
54G /WDC_112015/urls
346G /WDC_112015/data
408G /WDC_112015/anchor_pages
37G /WDC_112015/stats
905G total
- (gzipped TSV)
anchors: page_url | anchor text | wikipedia_url
- (gzipped TSV)
feeds: page_url | link_tag | link_type
- (gzipped TSV)
urls: page_url
- (gzipped text)
data: <quintuplet>\n
(subject, predicate, object, page, extractor_used) - (gzipped json)
anchor_pages: {"url": page_url, "content": full_html_page}\n
- (gzipped TSV)
stats: arcFileName | arcFilePos | detectedMimeType | hostIp | html-head-meta | html-mf-adr | html-mf-geo | html-mf-hcalendar | html-mf-hcard | html-mf-hlisting | html-mf-hrecipe | html-mf-hresume | html-mf-hreview | html-mf-species | html-mf-xfn | html-microdata | html-rdfa | html-rdfa11 | mimeType | recordLength | referencedData | timestamp | totalTriples | uri
First step was of course to sync this bucket locally and backup it to our NAS. Once this done and after having triggered security warnings at the network admins office for downloading almost 1TB at full speed, we proceeded to copy this data to DAPLAB hadoop cluster. DAPLAB is an awesome project aiming at providing a powerful data processing cluster on a freemium and premium basis for companies that cannot afford their own cluster, for researchers and scientists, etc. They also organize weekly hacking sessions to which anyone can attend and get access to the cluster for free. Our research lab has been partnering with DAPLAB since the beginning and DAPLAB infrastructure is a very nice complement to our lab’s hadoop cluster. (No more advertising in this post I promise.)
We now have all the data on HDFS, let’s process it. Keep in mind this blog post is about the XML syndication feeds and not how and what we did with the other things we extracted (anchors, RDF triples, etc).
[vfelder@daplab ~]$ hdfs dfs -ls /data/WDC_112015/data/feeds | head -n2
Found 35669 items
1.7M 2016-02-28 15:21 /data/WDC_112015/data/feeds/ex_common-crawl_crawl-data_CC-MAIN-2015-48_segments_1448398444047.40_warc_CC-MAIN-20151124205404-00000-ip-10-71-132-137.ec2.internal.warc.gz.csv.gz
We have 35,669 gzipped TSV files. We don’t want to work with this format and compression because gzip is slow and CSV/TSV is not ideal to query the data. Also, 35k files is a bit too much for a ~10 nodes cluster. We will convert these 35k gzipped TSV to 1,000 snappy-compressed parquet files.
Why converting 36k gunzip csv files to 1,000 snappy parquet files?
- Parquet is better suited for querying (eg. using Hive) than CSV:
- Parquet files contain their schema, CSV don’t.
- Parquet files store data by column, CSV is row-based.
- Columns of a parquet file are compressed (each column being compressed according to its data type).
- Snappy also offers great advantages, here compared to gzip:
- Snappy compression is orders of magnitude faster than gzip.
- Snappy happily trades compression against read/write speed. After all, when running a job that loads terabytes of data from HDFS on a cluster, we care more about fast read/write than about sparing a few TB.
- Less files means more throughput (as long as we have more files than processing cores of course). If we assume decompressing a file takes 100ms, the decompression alone for all 36k files will take a cumulated one hour. Of course the whole process will be distributed and run in parallel but it still constitutes overhead compared to working on only 1,000 files.
- We could have chosen Avro, but our schemas being very basic and our need being to query columns, Parquet made more sense.
Remember our TSV “schema” is page_url | link_tag | link_type
. It’s a good thing we saved the whole <link
tag because I realized after the EC2 run it might be interesting to extract a few other possible attributes. I wrote a short Scala script for Spark to extract these things and go from csv.gz
to 1,000 .snappy.parquet
.
These 60G total / 35k .csv.gz
, once converted to 1k .snappy.parquet
with 4 additional columns, now take 187G. Not bad.
We could now create a Hive table like this:
EXTERNAL
means the data for this table isn’t moved by Hive to the tables location, it just stays where it is and Hive loads the data directly from these files. Should we drop this table the data won’t be affected.
I also performed the above csv.gz
to snappy.parquet
conversion + hostname extraction for /data/WDC_112015/data/urls/
which contains all crawled URLs. The Scala program to do this is very similar to the one listed here and can be found here.
Hive is nice to run some queries, but Spark SQL is equally nice and generally offers better performances. It also allows me to run queries directly from the ipython notebook I’m writing this blog post from, get the queries results and rework them directly in python.
The first part of this blog post was writting in Markdown in my ipython notebook running pyspark, the rest of post is playing with the data directly from ipython, writing pyspark.sql queries and executing them directly from ipython, running them with pyspark over the cluster.
I got this handy *sh alias I’m using to spawn a notebook in a screen:
Let’s work with our extracted feeds and all crawled URLs
First, import some spark-sql libs.
sc
is spark context, it’s already there because we’re running ipython on pyspark.
We stored the feeds and urls as snappy compressed parquet files. All we have to do is read them as parquet, everything else is taken care of.
Common Crawl said their dump had 1.82B URLs, let’s check if that’s also the number of web pages we used for our extraction.
{'total number of urls': '1,823,130,936'}
Nice, it seems the data matches.
Feeds and URLs files include these two columns : page
and hostname
, respectively the original URL of the web page and the hostname of this URL. A single web page can provide several feeds but we’re only interested by the number of websites which provide at least one feed, so we take the distinct hostnames. Same with URLs: we crawled a lot of URLs be are only interested by the number of distinct hostnames, to compare the two numbers.
{'hostnames processed': '25,243,438',
'hostnames with at least one feed': '10,294,833'}
'40.78%'
Ok, around 40% of the crawled websites provide at least one XML feed.
But on average, how many feeds per webpage?
1.942656384170972
Now it would be interesting to see which standards these feeds implement.
+--------------------+---------+
| type| count|
+--------------------+---------+
| application/rss+xml|877891180|
|application/atom+xml|215504961|
| application/xml| 20276729|
| text/xml| 17086738|
| application/rdf+xml| 14669025|
| application/atom| 545717|
| application/rdf| 425259|
| application/rss| 108139|
| text/rdf| 9637|
| text/rss+xml| 245|
| text/rss| 125|
| text/rdf+xml| 14|
+--------------------+---------+
Let’s do some basic stats:
{'Atom': '216,050,678', 'RSS': '877,999,689'}
{'% Atom': '18.84%', '% RSS': '76.58%'}
'97.09%'
That’s about it, question answered!
Let’s quickly recap:
- Common Crawl is an excellent free dataset.
- WDC Framework is a cool tool but requires some tweaking.
- It is possible to extract a fair amount of data from this 151TB dump for less than 500USD of EC2.
- Out of 1.82B URLs we got 25.2M different hosts.
- Processing this data is really fast on DAPLAB’s cluster with my ipython+pyspark setup. This last (
alternate_total
) query takes <10s. - 40.8% of these hostnames provide a web syndication feed.
- RSS is the most popular syndication format by far with 76.6%, Atom is at 18.8%.
- 97.1% of the
<link
tags pointing to an RSS or Atom feed also specifyrel="alternate"
, as the HTML standard recommends.