Software Development Experiences: January 2008

When I need to sort files by date initially I under estimated the problem. A simple Java implementation using a TreeSet with comparator on file.lastModified() worked well for a files list up to 300k. Beyond that performance dropped significantly, more specifically while getting a list of files from remote shares.

From the stack dump it is clear that system is spending more time in getting last modified time from the file system. Below is the analysis about the problem situation. Every operation on tree set is in the order of logN. So adding N elements is in the order of N2logN . So if the number calls to remote file system increasing in this order with the number of files.
A solution for this problem is to use the cached version of the File object which is a sub class of java.io.File with cached properties of lastmodified time, full path and other required properties used by comparator for sorting. This brings down the number of calls for getting file properties will be decreased to N.

Implementing simple cache initially looks simple over the time it becomes complex due to the natural complexities of caching. If any open source cache implements matching to the requirement definitely it is a preferred approach instead of re-inventing the wheel. If these solutions are not suitable or it is too much for the project requirements, below problems need to be taken care in implementation.

Scope of the Cache:

The very first problem we need to look at is scope of the cache. Caching global objects in user sessions never make sense. If the number of sessions increased, increases the memory demand and easily enters into the out of memory situations. Also while initializing the global cache, we need to take care of synchronization aspects.

Object Relationships:

The second problem need to be handled during design is, object relation ships. For a one-to-many relation ship, if parent is cached it leads to the duplication of child objects. This also leads to the out of memory issues. To address this problem, it is required to consider multi-level cache. The object relationships need to be folded into multiple levels. For example, a user has multiple objects, instead of caching user to list of objects; we need to break it into two caches. First cache stores the list of keys for the objects instead of actual objects. In second level cache we will store the mappings between object key to object. More specific techniques can be implemented to avoid replicated objects in cache in a similar line.

Cache Cohesiveness

The third problem need to be taken care is updating or invalidating cache. There is no any bullet proof solution to this problem. While implementing APIs it needs to be taken care of this cache aspect. This problem becomes more complicated in multi node applications. However based on the nature of read write operations we can optimize the cache maintenance operations. If write operations are not expected by design, obviously no need to worry about the cache invalidation. If write operations expected very rarely, we can tear down the entire cache after write operation.

Control Cache Size

Cache can not assume infinite space in memory. It is required to control the cache size using some algorithms like LRU, FIFO etc. Using Java LinkedHashMap we can implement this kind of caches easily.

Software Development Experiences

Wednesday, January 23, 2008

Sorting Files by Date in Java

Caching Complexities

Blog Archive

About Me