Yesterday I had an idea about how to manage parsers in dynamic way. My idea was like this
- A managing class can access the parsers folder and check which parsers (files) are available
- A parser describes which file formats (and which versions of it) it can parse
In a script I have a file f |
I call pf = ParserFactory() |
Upon call this ParserFactory constructor, the ParserFactory class gathers information from all parsers available. Say they are in a folder parsers. The ParserFactory will then get the contents of the folder parsers and for each class file construct an object, from which it can request information.
So for example psimi.py will the ParserFactory give a PSIMIParser object. The ParserFactory can then call a method like getSupportedFileFormats() which returns the file formats which can be parsed by this PSIMIParser. |
Testfile:
pf = parsers.parserfactory.ParserFactory()
pf.getAvailableParsers()
Then for a single parser I have a GenericParser interface which every parser should implement:
"""
This GenericParser is an abstract superclass. It defines methods which subclasses should override. This guarantees us that
certain functions exist.
@author: Patrick van Kouteren
@version: 0.1
"""
class GenericParser:
"""
self.fileformats will be a dictionary in the form 'extension : version '. E.g. 'xml : 1.0'
"""
def __init__(self):
raise NotImplementedError("This is a GenericParser. Please define a list 'self.fileformat' here with extensions which your parser can parse!")
def getSupportedFileFormats(self):
raise NotImplementedError("This is a GenericParser. Please return the list 'self.fileformat' here which should be defined at self.init")
def parse(self, filename):
raise NotImplementedError("This is a GenericParser. Please implement a proper parse function")
The crux is in this getSupportedFileFormats function. I would like to call this function on every parser file. This file contains a parser class, so basically I want to create an object from a file, but I don’t know the object’s name from the file.
Currently my ParserFactory looks as follows (note that I’m still working on it, so not all is finished yet!):
"""
This ParserFactory contains knowledge about how to parse files. It can be fed a file and return the parsed data.
It uses the parsers in this parsers folder, but abstracts away various operations.
@author: Patrick van Kouteren
@version: 0.1
"""
import os, types, sys
class ParserFactory:
"""
Gather info about the contents of the database which are important for parsing
"""
def __init__(self):
self.importedDatabases = self.checkImportedDatabases()
"""
Check which databases (and which versions of them) are present in the database
"""
def checkImportedDatabases(self):
databases = {}
return databases
"""
Return a list of databases and their version which are present (imported) in IBIDAS
"""
def getImportedDatabases(self):
return self.importedDatabases
"""
Check the parser directory for files which import the GenericParser class. If a file does so, it is guaranteed
that we can call certain methods to request properties
"""
def getAvailableParsers(self):
parserdir = sys.path[0] + "/parsers"
classnames = []
parserfiles = []
for subdir, dirs, files in os.walk(parserdir):
for file in files:
if file.endswith(".py") and not file.startswith("__"):
parserfiles.append(parserdir + "/" + file)
for parserfile in parserfiles:
fileobject = open(parserfile)
content = fileobject.read()
importfound = 0
for l in content.splitlines():
if l.startswith("import"):
""" If this line contains genericparser, we know that this file is interesting """
if l.find("genericparser") > 0:
importfound += 1
"""
If we:
* find a class definition
* have found an import of generic parser
* find a genericparser argument
Then we know that this parser is a subclass of genericparser
"""
if l.startswith("class") and importfound > 0 and l.find("genericparser") > 0:
import re
m = re.split("\W+", l)
classname = m[m.index("class") + 1]
print "file " + parserfile + " contains a callable parser class called " + classname
thisfilename = parserfile[parserfile.rfind(u"/")+1:]
ppath = "parsers." + thisfilename[:thisfilename.rfind(u".")]+"." + classname
#print "the full import path will become " + ppath
classnames.append(ppath)
return classnames
"""
Return the supported file formats. The supported file formats are determined by checking all parsers which import
the GenericParser. We can request the file formats they support and return this list.
"""
def getSupportedFileFormats(self):
fileformats = []
parserlist = self.getAvailableParsers()
for parser in parserlist:
p = self._get_func(parser)()
for ff in p.getSupportedFileFormats():
fileformats.append(ff)
return fileformats
"""
Try to import a module. Then we can use this to get its class
Source: http://code.activestate.com/recipes/223972/
"""
def _get_mod(self, modulePath):
try:
aMod = sys.modules[modulePath]
if not isinstance(aMod, types.ModuleType):
raise KeyError
except KeyError:
# The last [''] is very important!
aMod = __import__(modulePath, globals(), locals(), [''])
sys.modules[modulePath] = aMod
return aMod
"""
Return the class from 'parsers.file.class'
Source: http://code.activestate.com/recipes/223972/
"""
def _get_func(self,fullFuncName):
"""Retrieve a function object from a full dotted-package name."""
# Parse out the path, module, and function
lastDot = fullFuncName.rfind(u".")
funcName = fullFuncName[lastDot + 1:]
modPath = fullFuncName[:lastDot]
aMod = self._get_mod(modPath)
aFunc = getattr(aMod, funcName)
# Assert that the function is a *callable* attribute.
assert callable(aFunc), u"%s is not callable." % fullFuncName
# Return a reference to the function itself,
# not the results of the function.
return aFunc
"""
Returns the class where a method is defined
def find_defining_class(self, obj, meth_name):
for ty in type(obj).mro():
if meth_name in ty.__dict__:
return ty
"""
"""
Check if all databases needed to import a file are present
"""
def checkPrerequisites(self, file_prerequisites):
errors = []
"""
Parse a list of files. This means that we not only have to check the prerequisites, but also an order in which to
parse the files as a file can be a prerequisite for another file
"""
def parseList(self, filelist, filelist_prerequisites):
order = self.findParseOrder(file)
def parse(self, file, file_prerequisites, parser=None):
""" First check if all prerequisites are present """
errors = self.checkPrerequisites(file_prerequisites)
if not empty(errors):
data = "\n".join(errors)
raise prerequisitesError, data
else:
if not parser:
parser = self.findParser(file)
else:
pass
if parser:
self.doParsing(file, parser)
else:
raise parserError
"""
Based on several things a parser is tried to be found.
1. The file extension: certain file extensions belong to specific formats
2. The first line:
"""
def findParser(self, file):
pass
Any thoughts, comments and discussions are appreciated. For more information: Chris Leary has posted an improvement here