How can Hadoop help to address the challenge of securing sensitive information in a document and limit it to your private cloud? If you want to know how the amazing guys at Xerox have proposed novel architecture to this business problem, then read on.
Since many documents and applications such as electronic medical records (EMR), tax forms, surveys, claims may contain both sensitive private as well as public information, there needs to be a way to protect private and still be able to distribute public info from the same document.
In the architecture approach invented by Shanmuga-nathan Gnanasambandam, Naveen Sharma, Wendell Lewis Kibler (Xerox), firstly, a processor executes to determine document structure from interconnected documents and intelligently indicate “specific information, passages, and/or components of the document as sensitive or insensitive information”. The private information is stored as a file along with meta data on internal cloud storage while the public information is stored as a public file on the external cloud storage like Amazon.
Private and public files may be stored in a replicated fashion in a distributed file system (like HDFS) where a file may be replicated and/or split into a plurality of pieces. “Each piece or replica differs slightly from the others in that each piece or replica includes a bit pattern different from the other (i.e., each replica is not identical byte-for-byte to any other replica)”.
The team goes one step further and once the replicas are stored, the replication process of Hadoop kicks in to store one or more replicas relatively close to the point of consumption and one or more replicas one or more hops away from the point of consumption. “As a result, the farther a particular replicated file is from the point of consumption, the larger the number of replicated files to decode or crack and the longer the encryption key”.
When a user needs to access the entire document, the client program may access and decrypt private file from private cloud along with public file from public cloud, merges them to show one consolidated document view. Hadoop here is architected to compute and store documents in accordance with a multi split/replica approach enabling the unique design.