Wednesday, January 23, 2008

Sorting Files by Date in Java

When I need to sort files by date initially I under estimated the problem. A simple Java implementation using a TreeSet with comparator on file.lastModified() worked well for a files list up to 300k. Beyond that performance dropped significantly, more specifically while getting a list of files from remote shares.

From the stack dump it is clear that system is spending more time in getting last modified time from the file system. Below is the analysis about the problem situation. Every operation on tree set is in the order of logN. So adding N elements is in the order of N2logN . So if the number calls to remote file system increasing in this order with the number of files.
A solution for this problem is to use the cached version of the File object which is a sub class of java.io.File with cached properties of lastmodified time, full path and other required properties used by comparator for sorting. This brings down the number of calls for getting file properties will be decreased to N.

Caching Complexities

Implementing simple cache initially looks simple over the time it becomes complex due to the natural complexities of caching. If any open source cache implements matching to the requirement definitely it is a preferred approach instead of re-inventing the wheel. If these solutions are not suitable or it is too much for the project requirements, below problems need to be taken care in implementation.


Scope of the Cache:

The very first problem we need to look at is scope of the cache. Caching global objects in user sessions never make sense. If the number of sessions increased, increases the memory demand and easily enters into the out of memory situations. Also while initializing the global cache, we need to take care of synchronization aspects.


Object Relationships:

The second problem need to be handled during design is, object relation ships. For a one-to-many relation ship, if parent is cached it leads to the duplication of child objects. This also leads to the out of memory issues. To address this problem, it is required to consider multi-level cache. The object relationships need to be folded into multiple levels. For example, a user has multiple objects, instead of caching user to list of objects; we need to break it into two caches. First cache stores the list of keys for the objects instead of actual objects. In second level cache we will store the mappings between object key to object. More specific techniques can be implemented to avoid replicated objects in cache in a similar line.


Cache Cohesiveness

The third problem need to be taken care is updating or invalidating cache. There is no any bullet proof solution to this problem. While implementing APIs it needs to be taken care of this cache aspect. This problem becomes more complicated in multi node applications. However based on the nature of read write operations we can optimize the cache maintenance operations. If write operations are not expected by design, obviously no need to worry about the cache invalidation. If write operations expected very rarely, we can tear down the entire cache after write operation.

Control Cache Size

Cache can not assume infinite space in memory. It is required to control the cache size using some algorithms like LRU, FIFO etc. Using Java LinkedHashMap we can implement this kind of caches easily.

Friday, November 2, 2007

Classification of Software Applications – Customer View

There may be several classifications already exists for Software applications. I want to classify applications based on their customers independent of the technology.

At a high level below is the classification of software application business segments

  1. End User Applications
  2. End user service provider application
  3. Enterprise Applications
  4. Enterprise Service Provider Application

Applications in every segment have their unique requirements. We can not impress users of one segment with the application developed for another segment. In reality there may be a slight overlap, but we can identify the sweat spot easily if we look at the feature set carefully. In this article I will try to explain each segment and its unique properties. These aspects not just related to product marketing, product development also required to understand to develop best fit applications for the requirement world.

End User applications are targeted for install, use and maintain by the end users. Very good installation wizards, easy or zero maintenance are some of the key properties. Application should work in normal environments without having a need for installing additional applications. Some of the examples for this category are Editors, Games, Utilities like file compression, etc.

End user service provider applications are usually web applications. One service provider hosts the application, several users uses the system concurrently. Unlike end user applications, installing and uninstalling application is not very important but other features like backup, staging and restore play a critical role. It is required to support huge number of concurrent users.

Enterprise applications are the applications targeted to use in enterprise environments. To use an application in enterprise environment it need to have some mandatory requirements like authentication, role based access, scalability, performance. Application should be easily integrated with existing infrastructure. So integrating with external authentication systems and integration with existing management systems makes application to fit in enterprise environments.

For small enterprises maintaining infrastructure is a burden so they started relying on service providers with stringent requirements agreed in the form of SLAs. This adds special requirements to the applications running in this segment. Application scalability, performance and user management plays critical role. Non availability or performance issues may incur a huge loss to the service provider so monitoring these applications is very important. Securing privacy of individual users or enterprises is another unique property of the applications in this space.

Friday, October 26, 2007

Concurrent DB operations

In one of the projects I worked, there is a persistence layer which provides the API in the form of createOrUpdate method. This method starts a transaction verifies weather any object exists with the same business key. If exists, it will update existing object otherwise it will insert a new one.

While performing DB update operations like create or insert in multiple threads, I found two kinds of issues

  1. Phantom Read. When multiple threads making a decision of insert or update by performing DB lookup, two threads may declare insert but due to other thread, it leads to constraint violation errors.
  2. Transaction dead lock. If the transaction is relatively large, if another transaction trying to insert same object it leads to deadlock.

Setting a database transaction isolation level to serializable or declaring these methods as serializable solves the problem but it will defeat the purpose of parallel threads.

Below are the two kinds of workarounds for this issue:
  1. Re-submit failed transaction with random wait.
  2. Lock and write model

The first one is very simple to implement. When ever a transaction is failed due to, wait for some random time and retry the transaction. Identifying the cases that can be retried is important for this solution. For all the failures, it will not make sense to retry. Only recoverable error cases like constraint violation and deadlock can be retried.

In the second model, we need to identify the business key which makes the entity unique in domain and track the locks on this key. Every thread before attempting create/update transaction, it needs to obtain the lock on business key. After completing the transaction, it needs to release the lock. Releasing lock notifies the threads which are waiting for the lock. This kind of granular locking may not be possible with every database, so to keep the application database independent, this works better.

Sunday, October 14, 2007

Complexities in Copying Files with Java

Copying a file is a simple operation but while working with Java I noticed there is enough to talk about this operation.

The most basic form of file copying is, opening two streams and copy character-by-character. Even though it is very inefficient code, just to cover from the beginning, this code snippet is used.

FileInputStream fin = new FileInputStream(fromFile);
FileOutputStream fout = new FileOutputStream(newFile);
int aChar ;
while ( (aChar = fin.read())>0){
fout.write(aChar);
}

We can enhance the above code snippet by using a buffer of fixed size instead of copying character-by-character. Choosing an appropriate buffer size affects the performance. It can not be too low and too high.

FileInputStream fin = new FileInputStream(fromFile);
FileOutputStream fout = new FileOutputStream(newFile);
int charRead ;
byte[] buffer = new byte[BUFFER_SIZE];
while ( (charRead = fin.read(buffer))!=-1){
fout.write(buffer, 0, charRead);
}

We can consider the above code snippet trouble free and always works, but there is much significant performance improvement possible by using Java NIO. In the above approach, OS is reading the file content into memory and then copying to the Java buffers and following the same path back to write into the file. Using NIO, we can directly transfer from the OS buffers, without copying into the Java buffer. Usually copying into the OS buffers will be performed by the hardware drivers, so this operation takes less CPU cycles. This code snippet looks something like this.

FileChannel in = new FileInputStream(src.getAbsoluteFile()).
getChannel();
FileChannel out = new FileOutputStream(dst.getAbsoluteFile()).
getChannel();
long size = in.size();
long bytesTransferred = 0L;
for (long bytesWritten = 0L; bytesWritten <>
+=bytesTransferred) {
bytesTransferred = in.transferTo(bytesWritten,
CHANNEL_TRANSFER_SIZE, out);
}


The above code snippet improves the performance due to NIO but it introduces the OS specific dependencies. The OS resources like paged pool used by the NIO are very limited. If multiple threads copying big files, we easily ran out of buffers in Windows 2003. While copying large files Windows memory pages continue to be accumulated from pool until it reaches 160MB (80% of 200MB). Once this limit reached, Windows memory manager gets activated and it will free up the pool. If the remaining 20% quickly filled with subsequent requests before memory manager cleanup, it causes this Insufficient system resource exception (System error 1450).

Microsoft has an article on this subject
http://support.microsoft.com/kb/304101 . As per this article there are two ways to handle this problem:

Decrease the memory pool threshold so that, memory manager starts cleanup much early so we will have sufficient pool memory remaining before completing cleanup.
Set the memory pool to unlimited which gives maximum possible pool memory.

During my experience with this problem, I found the first alternative worked.


Saturday, September 22, 2007

Three Dimensions for Decision Making

Decision making is not so trivial in software development. We follow a systematic approach for decision making in our software development environment that helps in fast decision making. Even though decision making can not be automated, but helps a lot to make a decision or justify a decision.

In Enterprise Software development, it is required to deal with alternatives and trade-offs in every stage of the development. We need to choose the best alternative by keeping the current situation and future direction in mind to achieve customer satisfaction.

  • If we don’t make proper decisions about the usability of the product, it builds dissatisfaction about the product for customers.
  • If we don’t make decisions in proper time, it may add significant delay and additional cost to the project execution. Some times we may loose important customers also.

  • If we don’t make decisions by keeping the future direction of the product in mind, it may leads to major redesign and some times scraping the existing code.

Making all the decisions about all stages of the project before hand is practically impossible. It can be due to changing requirements or awareness about the more alternatives during the course of development. Some times some decisions deferred in one stage of the software development may become significant during the next stage and these decisions may have bubble impact on previous stages. The difficulty in decision making is more visible in distributed teams than a single team working in one geographical location due to the limited communication among the team.

In Innominds, we follow a systematic approach to handle any trade-off by analyzing the solution in three dimensions.

Dimension #1: Usability
Usability of a software solution is very important, to reach the customer satisfaction. We need to target for the solution that gives the best usability for the product

Dimension #2: Scalability
While choosing an alternative we need to analyze weather the solution is scalable or not when the size of the problem increases. If we don’t consider the realistic scalability requirements, product may pass the quality check in our lab environments but it may fail at very first customer.

Dimension #3: Time-To-Market
Time-To-Market is very important for any software product. To meet the time requirements it is acceptable to sacrifice the scalability aspect, but it is not recommended to sacrifice usability.

Sunday, August 26, 2007

Coding, Testing: Which is First?

I saw some managers planning that Resource-A will do the coding and Resource-B will do the Unit testing. In my opinion it is totally a misconception about the unit testing. Unit testing is not a task to complete. It is integral part of coding. Unit testing without coding and coding without unit testing both are not productive. In a test driven environment, we need to design and plan our work so that after coding few lines, write test cases and verify the code just we completed.

I noticed some people coding and directly testing from UI. In this approach unit tests gets lower priority in front of the anxiety to do the final testing using UI. Unit tests just considered as a distraction and additional work. This leads to a poor quality of code. More time spent on debugging during development and testing. Writing unit tests after coding increases development time and there is a more tendency that not able to cover all important cases. Again it leads to a poor quality of code. Bottom line is, if we don’t do it in a right way, adding more time can not improve the quality.

It is important to understand the reasons for the below questions
Why people tend to do the integration testing before unit testing?
How it matters unit testing after complete coding instead of testing while coding?

Some times people not sure about what they are coding and not sure weather it is going to work together with the rest of the system or not. So they will focus directly on integration testing. This is true because at the end if they find this solution is not going to work, it need to throw out all the code along with the unit tests. Instead of trying to do the coding work it is required to do some design home work. Some times adding an integration test with the stubs and discussing with the peers about the solution helps to improve the situation.

My answer to the second question is it definitely matters the way we write the unit tests. We can apply a generic principle, If the delay in getting feedback increased, its effectiveness will be decreased. I mean executing unit tests is something like getting feedback from the code. We can write good tests cases for the code written in few minutes compared to the code written in last week or last month. Also it takes more time to recollect and understand the code. Taking immediate feedback saves time in debugging during integration.