Thursday, June 14, 2007

 

Adding support for tight Enum and EnumSet into a hadoop application

After working with many projects that use a combination of int and String constants in various ways to mimic enumerations I learned to like java Enums. I am also a fan of Hadoop so I decided to see how they will fit together.

WritableUtils class has already support for writing and reading java Enum values. in WritableUtils Enum is serialized by converting value of enum.name() to Text. There is also another way of storing Enum state which requires less space. By using the oridinal() method will give you the index of the field in the Enum. This technique allows you to store Enum containing up to 255 different fields to a single byte. EnumWritable perhaps does not warrant to be in own class, an utility methods cabable of reading and writing enum to DataInput would be sufficient.


import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class EnumWritable<E extends Enum<E>> implements Writable {

private byte storage;

public EnumWritable() {
storage = 0;
}

public EnumWritable(Enum<E> value) {
set(value);
}

public <T extends Enum<E>> E get(Class<E> enumType) throws IOException {
for (E type : enumType.getEnumConstants()) {
if (storage == type.ordinal()) {
return type;
}
}
return null;
}

public void set(Enum<E> e) {
storage = (byte) e.ordinal();
}

public void readFields(DataInput in) throws IOException {
storage = in.readByte();
}

public void write(DataOutput out) throws IOException {
out.write(storage);
}

@Override
public String toString() {
return Integer.toString(storage);
}

@Override
public boolean equals(Object obj) {
if (!(obj instanceof EnumWritable)) {
return super.equals(obj);
}
EnumWritable that = (EnumWritable) obj;
return this.storage == that.storage;
}

@Override
public int hashCode() {
return storage;
}

}


Other useful Structure for storing equivalent of multiple boolean types or piece of data that consists of bits (that can be modeled as fields in enum) is EnumSet. Storage for EnumSet is also compact if take advantage of the oridinal() method. You can store EnumSet with storage space of 1 bit per field.

I wrote an prototype EnumSetWritable that takes from 1 to 4 bytes of storage dependeing on how many fields enum contains and what enums are in the set. 2 least signifigant bits store the size of storage (1-4 bytes) and rest of the bits store Enums in EnumSet. As I told you space for storing the EnumSet varies from 1 to 4 bytes so it can store EnumSets containing Enums with up to 30 fields. By ordering the fields in a way that fields that are most often in EnumSet will most probably lead to smaller space consumption.

Real World applications for EnumSet could be for example something like:



enum ContainsData {
Raw, Parsed, Processed;
}

EnumSetWritable<ContainsData> contains = new EnumSetWritable<ContainsData>();

public void readFields(DataInput in) throws IOException {
contains.readFields(in);
for (ContainsData type : contains.getEnumSet(ContainsData.class)) {
switch (type) {
case Raw: //read raw data;break;
case Parsed: //read parsed data;break;
case Processed: //read processed data;break;
}
}
}



Where you would use EnumSet to store the presence information of various data structures in a writable and so would avoid for example storing booleans (taking one byte of space each) for the same piece of information.

Labels: ,



Comments

Post a Comment

Subscribe to Post Comments [Atom]



<< Home

Navigation