BenMeadowcroft.com

Database research

By Ben Meadowcroft

Introduction

Aim of the report

The goal of this report was to give a brief description of the preferred research area in database technologies which was identified based on technologies mentioned in the Asilomar report. The report will also give a projection on how the market it will evolve in the next five years (products, competitors, markets, etc).

After researching into the technologies mentioned in the Asilomar report, it is apparent that the technologies that the company should invest in are those involving fast querying of federated database systems and other technologies that lead to even more efficient data-mining technologies.

Description

The research areas chosen are important database technologies identified by the Asilomar report. One of the technologies highlighted was the federated database system. In their work on the CDF system the University of East Anglia used the following description of a federated database system. "A federated database system (FDS) can be defined as a collection of independently managed, heterogeneous database systems that allow partial and controlled sharing of data without affecting existing applications." http://www.sys.uea.ac.uk/TechReport/C94/SYS-C94-04/node2.html.

As the Asilomar report also notes "the web is one large federated system".

The ability to offer technologies to our consumers that will enable them to query federated databases efficiently and to effectively data mine information sources will be a significant factor in customers choosing our technology other technologies provided. Data mining will be a significant growth area for database technologies as it is such a profitable tool to modern businesses.

In his dissertation titled Classification and Association Tempo Miner Project, George Koundourakis noted the following. "Data mining is the automated discovery of non-trivial, previously unknown and potentially useful patterns embedded in databases. With wide applications of computers and automated data collection tools in business transaction processing, massive amounts of transaction data have been collected and stored in large databases. Hence, data mining techniques are of immediate interest. Classification and discovery of association rules are two basic components of data mining that help marketing, decision making and business management." http://timelab.co.umist.ac.uk/publications/theses/1997.html (emphasis not in original)

Development of the Market

There are several companies involved in database research of this kind. These include Microsoft, IBM and Bell Laboratories for example. The technologies that they are developing are listed below:

Microsoft

"QP Recycler: Reuse - don't recompute"

This product is aimed at optimising database querying. As is apparent from it's name it is a query recylcer. The aim of the project is to eliminate the redundancy in database querying. The research project is based on operating a form of cache to eliminate some of the common computations that occur in queries. As is noted by the Microsoft team "The stream of queries seen by a database system often exhibits significant redundancy, that is, many queries contain similar selections, joins, and aggregations. Consequently, the system may be wasting valuable resources by recomputing the same result multiple times. To speed up query processing, it may be better to save some results for later reuse." http://www.research.microsoft.com/research/db/qprecycler/

IBM

"Garlic"

This product has a goal to tie together the data held in various repositories and enable efficient querying of large amounts of many types of information.

This goal is illustrated by the following example of a practical demonstration of a federated system. "In the medical field, hospitals often have separate information systems for each department. Radiology may store MRI scans, etc., in one system, Cardiology may store EKG's in another and the Lab may store lab reports in a document management system. Doctors, however, need access to all of this information when treating a patient. In the future, hospitals would like to be able to store patient folders on-line, enabling doctors to search within and across folders (`find all folders where the patient has symptoms similar to this one'). However, they are unlikely to move all the data to a new, centralised system, or in fact, to any new system that disrupts their existing applications or threatens the autonomy of the various departments." http://www.almaden.ibm.com/cs/garlic/

Bell Laboratories

"AQUA: Approximate QUery Answering"

Like the QP recycler research project undertaken by Microsoft, the Aqua project's aim is to minimise the time it takes to formulate an answer to a query. The way in which they wish to accomplish this goal however is slightly different. Rather than optimise the querying procedure using "recycled" queries the AQUA project attempt to reduce or eliminate the need to access the base data at query time.

The way AQUA achieves this is by providing "highly-accurate, approximate answers to queries using small, precomputed synopses of the underlying base data. For SQL queries that traditionally take minutes to answer, Aqua can supply an approximate answer in seconds, providing immediate feedback to the user." http://www.bell-labs.com/project/aqua/

Future Developments

The IBM Almaden research group noted that the web provided the greatest opportunity for data mining technologies to gain a significant market presence. This was because of the "huge collection of data (e.g. Yahoo collection ~50GB every day)" and also "the universal digital distribution medium makes data mining results actionable in fundamentally new ways".

http://www.almaden.ibm.com/cs/quest/papers/kdd99_chasm.ppt

In a tutorial (Presented at CIKM'98, ICDE'99 and SIGKDD 99) the following four "challenges were raised with regard to Web Mining:

  1. "The abundance problem (99% of info of no interest to 99% of people)
  2. Limited coverage of the Web (Internet sources hidden behind search interfaces)
  3. Limited query interface based on keyword-orientated search
  4. Limited customisation to individual users."

http://www.bell-labs.com/project/serendip/Talks/tutorial.ppt(Page 119)

The development of database technologies that the company develops must be able to address these challenges in order to present an attractive proposition to our potential customers. It is important we make our project more attractive than our competitors offerings. The market research firm Dataquest reported the following results for the database industry. "Buoyed by growth of new Internet applications and rising demand for business intelligence applications, the world-wide database software market had a strong year in 1999 with revenue reaching almost $8 billion, an 18 percent increase over 1998 revenue." http://gartner11.gartnerweb.com/dq/static/about/press/pr-b200020.html

The same market research firm also noted "The world-wide database industry is forecast to reach $12.7 billion by 2004. Dataquest analysts said the market will be driven by Internet-related applications, electronic commerce, content management, integrated business intelligence, and new mobile consumer and mobile business applications." http://gartner11.gartnerweb.com/dq/static/about/press/pr-b200020.html

Conclusion

As mentioned earlier fast querying of federated database systems and efficient data-mining technologies will be an area into which, the company can make serious advances. The research area that I would propose is development of a database integration environment with scalability to enable it to be implemented in a small organisation tying together various information sources, to being able to be used by a large company for web based data mining. This goal focuses on "content management and integrated business intelligence", two areas which will be driving the database industry to revenues almost reaching $13 billion.

The environment would also provide quick answers to queries, perhaps using a hybrid of the Qp Recycler and AQUA techniques. Providing quick approximate answers and also using an efficient query optimiser, thereby enabling users to make business decisions on the fundamental directions that the data mined information returns.