Saturday, May 12, 2012

Python Code Jam Wrapper

In the later rounds of the code jam, I like to have all of the pesky file input/output stuff sorted, so I can just concentrate on solving the problems. Here I will share the current state of my python wrapper for Code Jam.

It provides the following functionality:

  • By command line arguments (-t [default], -s <number> or -l) specify that I want to use file A-test.in, A-small.attempt0.in or A-large.in
  • By command line arguments (-r [default] or -v) specify that I want to run my code on the .in file, creating a similar file called .out, or I want to check that my code would turn the .in file to the .out file that already exists.
  • I can specify the command line argument -m <number>, and it will automatically do multiprocessing with that number of processes, handing each case on a single process.
My general use case will be as follows:
  1. Copy the sample in and out data into my directory (as A-test.in and A-test.out)
  2. Open a python file I have already created containing
    def reader(inFile):
       return 0

    def solver(case):
       return 0

    if __name__ == "__main__":
        from GCJ import GCJ
        GCJ(reader, solver, "/Users/lpebody/gcj/setup/", "a").run()
  3. Write the reader code and the solver code.
  4. Run the script to test on the test data (using python a.py -v -t)
  5. If this shows issues in my method, goto 3.
  6. Download small data (as a-small-attempt0.in)
  7. Run the script on the small data (using python a.py -s 0)
  8. Assuming no issues, upload the output.
  9. If issues arose (exceptions, took too long, output was wrong) return to 3 (incrementing the number in 7 when necessary). Otherwise revel in my new found points!
  10. (Think for a while about whether my code is good enough to work on the large data set) - optional.
  11. If I think it is already good enough, jump forward to step 14.
  12. Rewrite the slow bits of code
  13. Check that it is working on the test data (using python a.py -v -t) and the small data (using python a.py -v -s <number>). If it isn't, goto 12.
  14. Once I am sure that it will work, think about whether I need multiprocessing.
  15. Download large data (as a-large.in)
  16. Run the script on the large data (using python a.py -l)
  17. Upload the output
  18. (Worry for less than two and half hours on whether the output is correct.) - optional
Note: the checking currently does exact string checking, which is likely to fail for (e.g.) floats. I do intend at some point to alter this code to check floats up to a specified tolerance. 

In order to use the wrapper, you need to provide two functions, a reader and a solver.

The reader takes a single parameter, called inFile, and should return a data structure containing all of the information needed to solve the next case in the data set. inFile is a wrapper around python's own file class that currently provides the following methods:
  • getInt() - read a line containing one integer, and return the integer
  • getInts() - read a line containing whitespace separated integers, and return the integers
  • getFloat() and getFloats() - the same as getInt() and getInts() but with floats
  • getWords() - return the result of running python's split() on the next line of text
  • readline() - return the next line of text
For example, for the problem Diamond Inheritance from 2012 Round 1C, the reader function could read:

def reader(inFile):
    return [[z - 1 for z in inFile.getInts()[1:]] for k in xrange(inFile.getInts())]

The solver is fed in the data structure that the reader created. It should return the solution. The output file will just consist of Case numbers, and whatever the solver returns turned into a string. So again, for Diamond Inheritance, the solver function could read:

def solver(dependents):
    N = len(dependents)
    for i in xrange(N):
        dep = [False] * N
        q = [i]
        while (len(q)):
            for j in dependents[q.pop()]:
                if (dep[j]):
                    return "Yes"
                dep[j] = True
                q.append(j)
    return "No"

Then, finally, to wrap it all together, 
if __name__ == "__main__":
    from GCJ import GCJ
    GCJ(reader, solver, "/Users/lpebody/gcj/setup/", "a").run()

The third and fourth directories are whatever directory your input and output data should be in, and the letter (A, B, C, D, E or F) describing the problem within the contest.

Without any further ado, here is my wrapper code.
from getopt          import getopt, GetoptError
from multiprocessing import Pool, Queue, Manager
from os              import getpid
from sys             import argv, stderr
from time            import strftime

class FileWrapper:
    def __init__(self, file):
        self.file = file
    
    def getInt(self):
        return int(self.file.readline())
    
    def getInts(self):
        return [int(z) for z in self.file.readline().split()]
    
    def getFloat(self):
        return float(self.file.readline())
    
    def getFloats(self):
        return [float(z) for z in self.file.readline().split()]
    
    def getWords(self):
        return self.file.readline().split()
    
    def readline(self):
        return self.file.readline().strip()
    
    def close(self):
        self.file.close

class GCJ:
    """
    Wrapper for a lot of functionality that is useful when trying to solve a 
    Google Code jam question.

    For the case of Problem A of the 2008 Qualification Round, example code
    would be:

    def parse(inFile):
        N = int(inFile.readline())
        searchEngines = [inFile.readline().strip() for z in xrange(N)]
        Q = int(inFile.readline())
        queries = [inFile.readline().strip() for z in xrange(Q)]
        return [searchEngines, queries]

    def solve([searchEngines, queries]):
        N = len(searchEngines)
        dict = {searchEngines[i]: i for i in xrange(N)}
        qs = [dict[k] for k in queries]
        used = [False] * N
        numberused = 0
        resets = 0
        for k in qs:
            if not used[k]:
               if numberused == N - 1:
                  resets += 1
                  numberused = 0
                  used = [False] * N
               used[k] = True
               numberused += 1
        return str(resets)

    if __name__ == "__main__":
        from GCJ import GCJ
        GCJ(parse, solve, "c:\\temp", "a").run()

    This would give a script with command line options -t, -s, -l, -r, -v, -m:
      -t (default)
        input file  = "c:\\temp\\a-test.in"
        output file = "c:\\temp\\a-test.out"
        error file  = "c:\\temp\\a-test.err"
        (the directory "c:\\temp" and the character "a" at the start of the
         filenames come from arguments to GCJ in the source code)
      -s <number>
        input file  = "c:\\temp\\a-small-attempt<number>.in"
        output file = "c:\\temp\\a-small-attempt<number>.out"
        error file  = "c:\\temp\\a-small-attempt<number>.err"
      -l
        input file  = "c:\\temp\\a-large.in"
        output file = "c:\\temp\\a-large.out"
        error file  = "c:\\temp\\a-large.err"
      -r (default)
        read the input file. Read the first line to get the number of cases N
        included in the input. Run parse(infile) N times to get N case objects,
        and then run solve(case object) on each of them. Output the strings
        returned from solve, prepended with "Case #<case #>: ". Put output in 
        output file, overwriting it if it exists.
      -v
        Do the same as above, but check if the contents of the output file now
        are as they would be otherwise. If it is, say "OK". Otherwise, output all
        differences, and write the output into the error file, overwriting it if
        it exists.
      -m <number>
        Set up a pool of <number> worker threads, and put all of the problems
        (along with their problem numbers) on a queue. Each thread, when finished
        with a problem, will take the next case from the queue.
"""

    def __init__(self, reader, solver, directory, question):
        self.reader = reader
        self.solver = solver
        self.question = question
        print >> stderr, "GCJ Wrapper initiated."
        print >> stderr, "Reading command line arguments"
        try:
            opts, args = getopt(argv[1:],
                                "vs:tlm:",
                                ["validate", "small", "test", "large", "multi"])
        except GetoptError, err:
            print str(err)
            exit(2)
        datatype = "test"
        self.job = "run"
        self.multi = 1
        for o,a in opts:
            if o in ("-s", "--small"):
                datatype = "small-attempt" + str(int(a))
            elif o in ("-l", "--large"):
                datatype = "large"
            elif o in ("-v", "--validate"):
                self.job = "validate"
            elif o in ("-m", "--multi"):
                self.multi = int(a)
        filepref = directory + "/" + question + "-" + datatype
        self.infile = filepref + ".in"
        self.outfile = filepref + ".out"
        self.errfile = filepref + ".err"
        if (self.job == "run"):
            print >> stderr, "Creating " + self.outfile + " from " + self.infile
        else:
            print >> stderr, "Testing that " + self.outfile + " would create " + self.infile
            print >> stderr, "Storing output in " + self.errfile + " otherwise"
    def run(self):
        data = self.read()
        if (self.multi == 1):
            answers = self.process(data)
        else:
            answers = self.multiprocess(data, self.multi)
        text = "".join(["Case #%d: %s\n" % (k + 1, answers[k]) for k in xrange(len(answers))])
        if False:
            print text
        self.output(text)
    def read(self):
        inh = FileWrapper(open(self.infile, 'r'))
        N = inh.getInt()
        data = [self.reader(inh) for k in xrange(N)]
        inh.close()
        return data
    def process(self, data):
        N = len(data)
        answers = [None] * N
        for k in xrange(N):
            print "%s:Working on Case %d" % (strftime("%X"), k + 1)
            answers[k] = self.solver(data[k])
            print "%s:Dealt with Case %d: %s" % (strftime("%X"), k + 1, answers[k])
        return answers
    def multiprocess(self, data, numprocs):
        N = len(data)
        manager = Manager()
        queue = manager.Queue()
        data = [[k + 1, data[k], self.solver, queue] for k in xrange(N)]
        pool = Pool(numprocs)
        pool.map_async(multiCase, data, 1)
        answers = [None] * N
        workerPids = []
        workingOn = []
        numTodo = N
        numDoing = 0
        numDone = 0
        for k in xrange(2 * N):
            report = queue.get()
            pid = report[0]
            if pid in workerPids:
                idx = workerPids.index(pid)
            else:
                idx = len(workerPids)
                workerPids += [pid]
                workingOn += [None]
            if report[1] == 0:
                workingOn[idx] = report[2]
                numTodo -= 1
                numDoing += 1
            else:
                workingOn[idx] = None
                answers[report[2] - 1] = report[3]
                numDoing -= 1
                numDone += 1
            print "%s:then %s now %s soon %s | %s" % (strftime("%X"), clean(numDone), clean(numDoing), 
                                                      clean(numTodo), " ".join([clean(z) for z in workingOn]))
        return answers
    def output(self, text):
        if self.job == "run":
            open(self.outfile, 'w').write(text)
        else:
            text2 = open(self.outfile, 'r').read()
            if (text.strip() == text2.strip()):
                print "OK"
            else:
                print "Not OK"
                open(self.errfile, 'w').write(text)
                print "new output written to %s" % self.errfile
                print "Differences:"
                text = text.strip().split("\n")
                text2 = text2.strip().split("\n")
                for k in xrange(min(len(text), len(text2))):
                    if text[k] != text2[k]:
                        print "Line %d: '%s' vs '%s'" % (k + 1, text[k], text2[k])
                if len(text) != len(text2):
                    print "Files have different number of lines: %d vs %d" % (len(text), len(text2))

def clean(r):
    return "   " if (r == None) else ("%3d" % r)
def multiCase(inputList):
    [caseNumber, case, solver, queue] = inputList
    queue.put([getpid(), 0, caseNumber])
    answer = solver(case)
    queue.put([getpid(), 1, caseNumber, answer])

No comments:

Post a Comment