Search Engine for ILM?
I was reading an article (reference #1) and it struck me that the problem they are facing is an information retrievel or search engine problem.
Some background (from the article)
Information Lifecycle Management (ILM), "defines processes for managing data across an infrastructure throughout the data's useful life...". Because of government regulations such as HIPPA, Graham-Leech-Biley and HIPPA, enterprises must "retain certain data for set periods of time and/or produce data for auditors in short order".
"The holy grail of ILM is data management by some sort of classification scheme. Once a scheme is established, data must be tagged to ensure its migrated across the storage infrastructure according to class. You can construct policies to employ data class and usage characterisitics, and to determine what to move and when". "No scheme exists thus for data classification." "So you must (manually) identify, list and define data objects and classification criteria, and then compare them for similarities. .. Then you must find a way to apply classes consistently - by involving users or by harnessing yet-to-be-developed technologies, so that data movers can pick data and move it by class policy".
The data consists of mostly unstructured data (files), structured databases, semistructured data such as email, groupware and workflow forms.
"None of the more than 270 hardware and software products called ILM solutions really are." So the problem exists, but "ILM is not a product. There is no silver bullet and no ILM 1.0 product available, so stop shopping for one".
My analysis
This sounds like a classic data/text mining problem - you have large amounts of unstructured data (think - the web), you want to define categories or classes, segment the data into these classes and define policies (retention, security,..) for these clusters/segments of like-data.
You can use Clustering techniques on a small sample of the data, to perform exploratory data analysis. This will segment the data into clusters. You analyze the clusters to determine the class label. So now you have training data for a classification algorithm - since you have class labels defined and examples of those labels (the elements in the labeled cluster).
References:
Some background (from the article)
Information Lifecycle Management (ILM), "defines processes for managing data across an infrastructure throughout the data's useful life...". Because of government regulations such as HIPPA, Graham-Leech-Biley and HIPPA, enterprises must "retain certain data for set periods of time and/or produce data for auditors in short order".
"The holy grail of ILM is data management by some sort of classification scheme. Once a scheme is established, data must be tagged to ensure its migrated across the storage infrastructure according to class. You can construct policies to employ data class and usage characterisitics, and to determine what to move and when". "No scheme exists thus for data classification." "So you must (manually) identify, list and define data objects and classification criteria, and then compare them for similarities. .. Then you must find a way to apply classes consistently - by involving users or by harnessing yet-to-be-developed technologies, so that data movers can pick data and move it by class policy".
The data consists of mostly unstructured data (files), structured databases, semistructured data such as email, groupware and workflow forms.
"None of the more than 270 hardware and software products called ILM solutions really are." So the problem exists, but "ILM is not a product. There is no silver bullet and no ILM 1.0 product available, so stop shopping for one".
My analysis
This sounds like a classic data/text mining problem - you have large amounts of unstructured data (think - the web), you want to define categories or classes, segment the data into these classes and define policies (retention, security,..) for these clusters/segments of like-data.
You can use Clustering techniques on a small sample of the data, to perform exploratory data analysis. This will segment the data into clusters. You analyze the clusters to determine the class label. So now you have training data for a classification algorithm - since you have class labels defined and examples of those labels (the elements in the labeled cluster).
References:
- ILM Gets a Life, (Network Computing Magazine, 9/14/2005)
- Computer Science Corp.'s latest survey of the Financial Executives International regarding IT issues, including data growth and regulatory compliance
- Data Management Institute's free white paper on data-classification strategies, complete with forms for conductin a prelimnary analysis.
- Discussion boards and content covering ILM implementation and other storage issues
- Web blog on data management