Could the JCR be used as a data redundancy mechanism to archive remote unsafe content?
The rise of SaaS, Cloud Computing and other kind of distributed services (e.g: a distributed Google Wavelet) raises some more and more interesting questions in term of data redundancy and backup.
Let’s take the concrete example of this blog entry posted on the Posterous.com web site. For the moment this service is free and does not suffer from any major downtime. I am then more than a satisfied person. But what if this Posterous service goes suddenly bankrupt? What about the future of all my blog posts?
And even if Posterous becomes a huge success, what if I want to reuse and repurpose such pieces of content in my company intranet and enhance them with some other value added features (e.g: with some private rating; some employees-only restricted comments,…)?
This is of course only a simple example. But one could perfectly imagine more complex and critical use cases when a company begins to host part or whole its documents, social content (blogs, forum, comments,…) and other pieces of enterprise information on remote unknown systems.
Do we need a federated and standardized P2P-like JCR mechanism?
Coming back to my blog I was then thinking about how it would be nice to get an automated built-in service as part of a standardized JCR repository to locally replicate and keep in synchronization some remote content items with another more secured environment. Does it sound like P2P applied to JCR? Yes, perhaps. P2P technology had the merit to transparently reduce the risk to loose some content items by automatically replicating part or whole of them on distant servers. Such a mechanism does not exist in the JCR (yet) – or did I miss something?
On the other hand one of the key strength of a JCR repository is that you are not bound to one content type only (e.g: binary files) and that you can also add additional values on top of any content nodes (e.g: through the additions of certain mixins types).
Combined with versioning capabilities a company could then simply use such a system to periodically backup some external remotely hosted data. But some others may also easily reuse and leverage such a piece of locally replicated content as part of another application (e.g: a WCM, a DMS,…). This could also be considered as a kind of local serialized cache in order to avoid having to make remote and perhaps slow connexions. And if we reverse the problem, one could also easily replicate some local content to some cloud-based infrastructure (e.g: Amazon EC2 in order to remotely store all bandwith consuming videos of a company).
It looks like there are certainly lots of possible use cases especially for ECM solutions and companies which need to better collect distributed data about their companies and ensure a proper archival, respect for certain file plans or ease legal eDiscovery processes.
Of course content synchronization is never easy from a technical point of view especially when both the source and the destination could be updated (e.g: what if I update my blog post after it has been already replicated) and if the source is not using any standards which would allow the access to a last change log history or some other kind of observation mechanisms. Of course the idea would then not to focus only on a JCR2JCR bridge but on something more generic. For example I clearly doubt that my Posterous blog is JCR compliant (or even CMIS-ready). In such a use case scenario I risk to only have a RSS feed of my blog posts. So one should be able to add additional replication or synchronization providers to any remote content API he could think of.
So someone could certainly figure out a content type based on some URI which would act as a kind of remote weak reference combined with some more or less (un)defined types to store a local copy of the remote information with a scheduling service on top to periodically check for updates (if any) and do the merge/versioning exercise. But it sounds more like a workaround than something well thought of.
So would we need a more standardized “synchronizable / replicatable node” mechanism natively as part of the JCR? I am pretty sure such a subject was already discussed but I am too lazy this morning to dig the World Wide Web. Any useful pointers out there?
Keeping a backup of content that is posted to external systems is a worry for some organisations. How do we need the tweets that we post or blog posts to remote systems? JCR could be a solution for that but it might be a bit too "heavy".
( sorry for the delay, quite busy ;-) )
There are two things to consider. First getting some lightweight replication API at the repository level could be good. You can however do it through some differential XML import/export or directly at the DB level (if and only if all your content objects are stored in the DB which might not always be the case). The problem then is the granularity of the content items you replicate (all the repo or only some items according to their "file plans"?).
Secundly, and this was more the idea of this blog, we currently have no notion of "federated data" in the JCR (or even in CMIS). Think about Linked Data but with some level of data redundancy. For instance what if I plug a CMIS connector to SharePoint on my front-end WCM and my SharePoint server is temporarily down. Does it mean that part of the content on my front-end web site will be impacted because of a lack of appropriate data redundancy. And then what if my SharePoint server is being slow down by massive uploads. Does it mean that the generation of my front-end web pages will suddently become much more slower because of the back-end system latency time?
And finally if you extend such a usage to not only federated CMIS or JCR repositories but to any linked data, we could envision some kind of possible federated/data redundancy extensions at the JCR level.
Just some thoughts ;-)