Netsight Blog

Cool stuff Netsight are up to in Zope and Plone

Blog > Archive for March 2010

Today I read a great article by Jeff Potts of Optaros: A CMIS API library for Python, Part 1: Introducing cmislib. It talks about how to use the CMIS python library to mess about with CMIS and access CMIS repositories.

I've been mulling over the significance of CMIS for a while now, and still a little undecided. Yes I think its a great idea, but in reality will it actually be useful? As it is a lowest-common-denominator approach to interoperability, I'm betting that much of the value will be lost when using it for widescale migration from one CMS to another. There are already plenty or protocols for getting data in and out of systems, why add another?

I do however think it might be interesting to use it for Plone to allow Plone to access other systems as a content repository. In this case, I'm not talking deep-down integration (ie, I'm not talking about replacing the ZODB with a CMIS backend or even an Archetypes storage layer). At the moment I'm thinking quite lightweight integration at specific points. Ie imagine a portlet on a Plone-based intranet which lists the 10 most recent policy documents stored in an Alfresco/Sharepoint/Filenet system.

So I grabbed Jeff's library and had a quick play with it this morning to see what sort of thing could be done. I've kept my scope pretty small, and of course there is way more that could be done here, but I just wanted to hack together something quick to just see how it worked.

Results

The results worked pretty nicely. The version of cmislib on pypi (0.3dev) has some issues with Alfresco's latest server, meaning I couldn't do much, other than do a search and list the titles, but Jeff says that the svn trunk version fixes these.

I add my 'CMIS View' object to a Plone site, and give it the repository url, username and password:

CMISView Edit Screen (click for larger image)

And hit Save, and then I get a listing of the items on the CMIS repository. Here we have a listing of all items in the Alfresco public repository which contain the word 'test'.

CMISView View Screen (click for larger image)

How I did it

It was actually pretty simple, in fact most of my time was spent working out how to re-use the existing plone folder_contents template for my listing and how to actually use CMIS.

So firstly I added the cmislib package and my package I was going to develop to a fresh Plone 4.0b1 buildout:

[buildout]
...
eggs =
     cmislib
     collective.cmisview

# Reference any eggs you are developing here, one per line
# e.g.: develop = src/my.package
develop =
     src/collective.cmisview

[instance]
...
zcml =
    collective.cmisview

I then created a simple package:

matt-hamiltons-macbook-air-3:src matth$ ../../bin/paster create -t archetype
matt-hamiltons-macbook-air-3:src matth$ ../../bin/paster create -t archetype
Selected and implied templates:
  ZopeSkel#basic_namespace  A basic Python project with a namespace package
  ZopeSkel#plone            A project for Plone products
  ZopeSkel#archetype        A Plone project that uses Archetypes content types

Enter project name: collective.cmisview
Variables:
  egg:      collective.cmisview
  package:  collectivecmisview
  project:  collective.cmisview
Expert Mode? (What question mode would you like? (easy/expert/all)?) ['easy']:
Project Title (Title of the project) ['Example Name']: CMIS View
Version (Version number for project) ['1.0']: 0.1
Description (One-line description of the project) ['']: A simple content type for Plone that allows viewing a \
CMIS repository
Creating template basic_namespace
Creating directory ./collective.cmisview
...

And added a simple content type:

matt-hamiltons-macbook-air-3:collective.cmisview matth$ ../../../bin/paster addcontent contenttype
Enter contenttype_name (Content type name ) ['Example Type']: CMIS View
Enter contenttype_description (Content type description ) ['Description of the Example Type']: View of a remot\
e CMIS repository
Enter folderish (True/False: Content type is Folderish ) [False]:
Enter global_allow (True/False: Globally addable ) [True]:
Enter allow_discussion (True/False: Allow discussion ) [False]:
  Inserting from README.txt_insert into /Development/py26nsp2/cmis/src/collective.cmisview/collective/cmisview\
/README.txt
...

I then added the schema to the generated cmisview.py file:

CMISViewSchema = schemata.ATContentTypeSchema.copy() + atapi.Schema((

        atapi.StringField(
            name='cmisrepourl',
            widget=atapi.StringWidget(
                label='CMIS Repository URL',
                ),
            ),

        atapi.StringField(
            name='cmisrepouser',
            widget=atapi.StringWidget(
                label='CMIS Repository Username',
                ),
            ),

        atapi.StringField(
            name='cmisrepopass',
            widget=atapi.PasswordWidget(
                label='CMIS Repository Password',
                ),
            ),

))

and added a view class in browser/cmisbrowserview.py with a method to do the actual CMIS search:

     def getdocuments(self):
	client = CmisClient(self.context.getCmisrepourl(),
                            self.context.getCmisrepouser(),
                            self.context.getCmisrepopass()
                            )
	repo = client.defaultRepository
	res = repo.query("select * from cmis:document where contains('test')")
	return res

I added a method to iterate over the results and build up something that the standard folder_view template would like. Obviously if I was doing this in ernest I would create a custom template specifically suited to the CMIS results view.

    @property
    def items(self):
	results = []
	i = 0
        for doc in self.docs:
            if (i + 1) % 2 == 0:
                table_row_class = "draggable even"
            else:
                table_row_class = "draggable odd"

            props = doc.getProperties()
            results.append(dict(
                url = '',
	        url_href_title = '',
	        id  = doc.id,
	        quoted_id = urllib.quote_plus(doc.id),
		path = '',
                title_or_id = safe_unicode(doc.title),
	        obj_type = '',
	        size = props['cmis:contentStreamLength'],
		modified = props['cmis:lastModificationDate'],
		icon = '',
		type_class = '',
                wf_state = '',
                state_title = '',
                state_class = '',
                is_browser_default = False,
                folderish = False,
                relative_url = '',
		view_url = '',
                table_row_class = table_row_class,
	        is_expired = False,
	    ))

            i += 1
        return results

And that is pretty much it. Clearly there is a lot more you could do here, such as tidying up the info coming back (formatting the size in the listing) and registering a traversal adapter which would enable you to actually download the document listed. Maybe I'll do this once the next version of cmislib hits pypi.

We are in the process of building a new website for Netsight, and one of the items on the wish-list was a 'related items' portlet for the blog. With Plone you can manually related items, but there isn't really a way to display related items automatically. Many years ago we wrote something that did this on a client's site in Plone 1, but after a while the client asked it to be removed as it was bringing up some slightly embarrassing and controversial supposed related items.

I now wanted to resurrect the idea and create a Plone 3 portlet that did something along those lines. I'd seen topia.termextract a while back and was thinking of a fun use to put it to, and also saw collective.classification which sounded like it did something similar, but needs various external dependancies to be build (incl. C libraries I think, and download a large training corpus for it).

My final year thesis at university in 2000 was a full text indexer written in C, and so I had a fairly good grasp of relevance ranking algorithms and the likes. Indeed, back in 2002 at the first non-US Zope 3 sprint, I introduced Jim Fulton to 'Managing Gigabytes' a seminal book on indexing and information retrieval... this then lead to the creation of ZCTextIndex, the text index used in Zope today.

So I knew from the ZCTextIndex code that we should already have much of the information needed for determining similar content already calculated in the text index data structures -- which means it should be pretty fast.

The basic idea is this:

  1. Find the most 'important words' in the document you are looking at
  2. Search for other 'similar' documents based on those words


So how do we find the most 'important' words? Well in text indexing there is a common metric called TF*IDF. This is the Term Frequency by Inverse Document Frequency. Basically, an important word is a word that appears in this document at a higher frequency compared to other documents.

As an example the word 'the' appears in a specific document roughly the same number of times as all documents in our entire site. It also appears in virtually every document. So it is not that special. Whereas, the word 'conference' might appear in our specific document a number of times, yet doesn't appear that often in general, and doesn't appear in that many documents overall -- hence it is pretty 'important'.

The SearchableText ZCTextIndex already has a way to find all words in a document efficiently (ie. without having to parse the document again) as this information is stored in the indexes. It also stores various frequency and weight information -- ie everything we need for our calculation.

This means we can iterate over every word in our document, score each one as to how 'important' it is, and then return the top 20 words.

Once we have those words, we are onto the second part of the process and we search the catalog for all documents that match these terms. To do this efficiently, again we have to delve into the internals of ZCTextIndex and call some private methods. I know this is bad form, but is really needed for efficiency. If we used the public API to do the search then the catalog would treat the 20 terms in our query with an implicit 'and' which means that if any one of the terms doesn't appear in a candidate document then it will be excluded. This is not what we want, we want that document to be included, but if a specific term doesn't appear then just don't rank it as high.

The end result is we get a list of documents related to the one we are looking at:

Similar Items portlet

I later did a bit of an experiment to see how topia.termextract would fare in comparison to my TF*IDF approach to determining which words at the most important in a document. The results were actually surprisingly close:

topia.termextract:
['2010', 'venue', 'people', 'idea', 'year', '/', 
'university', 'uk', 'bristol', 'number', 
'community', 'conference', 'city', 'lot', 'room', 
'work', 'conference', 'talk', 'plone']

tf.idf:
['bristol', 'conference', 'venue', 'ploneconf2010', 
'suits', 'rooms', 'vote', 'silicon', 'city', 
'media', 'talks', 'university', 'bid', 'delegates', 
'lots', 'bbc', 'west']

So topia.termextract is returning a list of all nouns it knows of in the document, and tf.idf is returning all words in this document that occur more frequently than average. You can probably guess the content of the document they were both looking at ;)

So for now, I'll keep with my tf.idf approach, but maybe in the future it might be interesting to see how topia.termextact can be integrated, especially if you include the phrases that topia.termextract can find:

['plone projects', 'south west', '600,000 people', 
'plone conference', 'plastercine characters', 
'computer science', 'advocacy work', 
'silicon valley', 'case studies', 
'plone conferences', 'plone community', 
'cycle city', 'industry analysts', 
'technology evaluators', ...]

The product is available to download from plone.org or via buildout and PyPi if you want to install it, just add collective.portlet.similarcontent to your eggs and zcml lines in your buildout config file.

Well, we have about seven months until the Plone Conference rocks up in Bristol in October, and we've been busy securing the venue.

I am now proud to announce the Plone Conference 2010 will be held at the 4* Thistle Grand Hotel in central Bristol.

Thistle Grand Hotel

The hotel is located in the central old part of Bristol City, with its cobbled streets and old buildings. We have four rooms booked in the hotel for the main conference, the largest holding 400 people, the others holding around 90 people each. The hotel also can offer us around 60 rooms for delegates to stay in during the conference, and there is a wide number of other hotels of various costs, serviced apartments and youth hostels in walking distance.

http://www.netsight.co.uk/news/images/full_thistle_bristol_m_e_wessex.jpg

There is also plenty of social life around with a large number of restaurants, bars and clubs within a few minutes walk. All delegates will also have passes to the hotel's leisure centre as well.

http://www.netsight.co.uk/news/images/full_thistle_bristol_hotel_swimming_pool.jpg

But beyond all that, this hotel actually has quite a nice historical connection with what we are doing -- a connection I only discovered when searching for photos for this blog post.

Our main aim with the work we do with Plone and on the internet in general mostly revolves around communication. We allow others to communicate information better. Whether that be by designing public facing websites, private corporate intranets, or group collaboration spaces... we enable communication.

The White Hart and White Lion Inns

The Thistle Grand stands on the site of a previous important link in the world of communication -- the mail. Before the Grand Hotel there stood the White Hart and White Lion Inns. Looking at the lithograph above and the modern photo at the top you can see the same church tower behind the site of the venue. The White Lion Inn can be traced back to 1606 as an Inn and went on to become a very important part of the Kings Post:

"The White Lion, Bristol, was one of the most 
famous coaching houses in England, east, west, 
north, or south. It stood in Broad Street, a 
thoroughfare which belied its name as regards 
breadth, and could only be considered broad by 
comparison with the even narrower Small Street, 
which ran parallel with it. Yet at one time 
there were as many coaches passing in and out 
of Broad Street as any street in Bristol, or 
even in London!"

This excerpt it taken from a book originally published in 1905 entitled "The King's Post -- Being a volume of historical facts relating to the Posts, Mail Coaches, Coach Roads, and Railway Mail Services of and connected with the Ancient City of Bristol from 1580 to the present time." published online by Project Gutenberg.

Back then, with no telegraph or telephone (or even internet!) Bristol used to even have its own time, being 2 degrees 30 minutes west of London it was approx 10 minutes behind London on time, evidence of which can still be seen on the clock above the Corn Exchange in the centre which has two minute hands, one for London and one for Bristol:

Bristol Time

With the advent of the railway, and Bristol's famous son, Isambard Kingdom Brunel who build the Great Western Railway, Bristol Time was subsequently abolished and Railway Time adopted in 1852.

Anyway, that is the history lesson over. The main thing to be excited about is we have a venue :)

We are now currently selecting a venue for the conference dinner/party and our shortlist includes a former railway terminus (the world's earliest surviving purpose built railway terminus), an art gallery, a victorian bank, and an arts and media centre. We just need to work out which one will work out best for us!

As for talks, training and all that... We'll be putting a call out for training proposals and talk submissions once Plone Symposium East and European Plone Symposium are out of the way. We will be opening booking for the event in June.

-Matt