The goal of the data warehousing project at Stanford the WHIPS project is to develop ... A data warehouse is a repository of integrated information, available for ...
simplicity, ease-of-use and most importantly performance. â« Entity Relationships. One-Level deep only. Example of a Star Schema. Diagram/Figure from Oracle ...
Week 13: Data Warehousing. 2. Warehousing. âGrowing industry: $8 billion in 1998. âRange from desktop to huge: âWalmart: 900-CPU, 2,700 disk, 23TB.
Lehner: Datenbanktechnologie fÃ¼r Data Warehouse Systeme. ... References (ctd.) ... 80s: Oracle (relational data model). â¢ Age of Business Intelligence (95 -).
Ulf Leser: Datenbanksysteme II, Sommersemester 2005. 2 kd-Tree General Idea. â¢ Binary ..... Translate SQL query in relational algebra term. â Logical .... UPDATE salary SET salary=1.1*salary. â What happens?? ... Index has height log k. (n).
10. Comparing Hash Join and Sort-Merge Join. â¢ If enough memory provided, both require approximately the same number of IO .... 20. Use Term Rewriting. â¢ Algebraic term can be rewritten. â Ï name,adr ...... Nicht an jedem Tag wird jedes Produk
A Kemper, Dr. Eickler, TU MÃ¼nchen. â¢ Goya ... Someone should take care of finding, loading, storing data. â¢ I want to ... Oracle, IBM DB2, Microsoft SQL Server.
Transactions work on objects (attributes, tuples, pages). â¢ Only two different operations. â Read operation: R(X), R(Y), . . . â Write operation: W(X), W(Y), .
target configuration. Star, snowflake or relational schema. E-R and/or dimensional models extended for warehousing. Determined by tool set or application.
What is data warehousing? â« Characteristics of data warehouse. â« Processes & steps in building data warehouse. â« Analyzing the content of data warehouse ...
May 31, 2006 - INITRANS 1. MAXTRANS 255. NOCOMPRESS. LOGGING. STORAGE( INITIAL 65536. NEXT 1048576. MINEXTENTS 1. MAXEXTENTS â¦
integrated data warehousing and mining model is effective and efficient to apply into practical Web applications. ... Many researchers have proposed different data mining algorithms for mining user access patterns or ...... Website topology to distin
Query Execution. â¢ We have. â Structured Query Language SQL. â Relational algebra. â How to access tuples in many ways (scan, index, â¦) â¢ Now. â Given a ...
differences between the data warehouse and the transactional system. The purpose of the data warehouse is to provide information for managerial decision making that is distinct from, and does not interfere with the performance requirements of the OLT
Secure Data Management and Web Applications. Alexandros Labrinidis. University of Pittsburgh. 03 â Data Warehousing. CS 1655 / Spring 2013. Alexandros ...
The first direction refers to new emerging applications which introduce novel types of data and ask for data warehouse support ... With the advent of mobile computing and location-based services, popular use of satellite telemetry, remote sensing sys
Data Warehousing and Data Mining. Lecture 6 Data Reduction. Wei Liu. School of Computer. Science and Software. Engineering. Faculty of Engineering,.
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, "Introduction to. Data Mining" Pearson Addison Wesley 2008 ISBN: 0-32-134136-7. Data Mining , Pearson ...Missing:
knowledge concerning management support systems implementation ..... asset , and DW changing executives' ...... 689-713.  King, W. R., âDecision Support Systems for ....  Rivard, S. and S. Huff, âFactors of Success for. End-User ...
The major issues in building a data warehouse are schema design ... building a conceptual schema in terms of dimensions ... A star schema consists of a set of.
warehousing and regular databases are different. Data warehousing design depends on a dimensional modeling techniques and a regular database design depends on an Entity. Relationship model . The multidimensional modeling (e.g. star schema) provide
with a high degree of aggregation. Table 1: Differences between Transactional Databases and Data Warehouses. Transactional Database. Data Warehouse ..... 77/keynote.pdf. Roddick, J.F. (1998) Data Warehousing and Data Mining: Are we working on the rig
Physical Data Warehousing Design ... Recently, organizations have increasingly emphasized .... Relational database systems (RDBMS) are designed to record ...
Theodore Johnson. AT&T Labs â Research ... Ptolemy. â Zihui Ge, Don Caldwell, Bill Becke. .... Used for research, billing, network troubleshooqng. â 100 data ...
Keywords: on-line analytical processing, data warehouse, stylometrics, analogies Abstract The litOLAP project seeks to apply the Business Intelligence techniques of Data Warehousing and OLAP to the domain of text processing (specifically, computer-aided literary studies). A literary data warehouse is similar to a conventional corpus, but its data is stored and organized in multidimensional cubes, in order to promote efficient end-user queries. An initial implementation exists for litOLAP, and emphasis has been placed on cube-storage methods and caching intermediate results for reuse. Work continues on improving the query engine, the ETL process, and the user interfaces.
Computer-aided tools can enhance literary studies [11, 12], and the litOLAP system aims to support such studies by adapting the well-developed techniques of data warehousing (DW) and on-line analytical processing (OLAP) . These techniques are widely used to support business decision making. DW involves assembling data from various sources, putting it into a common format with consistent semantics, and storing it as multidimensional cubes to enable OLAP or data mining. OLAP techniques enable user-guided browsing of multidimensional and aggregate data. A business DW might integrate a merged company’s inventory and sales transactional data over several decades, and its typical OLAP operations could include “rolling up” (i.e., “zooming out”) from dates-by-week to dates-by-year or imposing range conditions (e.g., “omit 1983–1988 Manitoba data”). OLAP is for domain specialists, not IT professionals: the motto is “disks are cheap, analysts are expensive.” Building a DW over literary data (see Figure 1) differs from creating a corpus or digital library. The Extract-Transform-Load (ETL) stage retains only that information required to build the ordained DW cubes. Thus the DW schema constrains the possible analyses, but anticipated classes of queries can be fast. Dimensional hierarchies control allowable roll-ups, hence they are crucial. Our Word dimension allows roll-up by stem, suffix, and (several layers of) WordNet  ∗
current affiliation: Innovatia Inc., Saint John, New Brunswick In CaSTA06, Fredericton, Canada, October 2006.
Warehouse cube 1
Amazon.com web service
Collection of texts (Proj. Gutenberg)
ample storage cube n
Library of Congress summary 1
Various data sources Custom Query Engine
Mondrian user−driven queries User 1 User 2
client application (including web server)
off−the−shelf engine (interim)
Figure 1: Literary DW, with existing portions in boldface. hypernyms, for instance. (For instance, WordNet hypernyms for “whale” include the category “vertebrate, craniate”.) As an example, our AnalogyCube was inspired by Turney and Littman . It records how many times, when a short sliding window passes over a text, the subsequence word1 , joiningWord, word2 is seen. For instance, the count for (whale, beneath, waves) is presumably higher in Moby Dick than in Hamlet. Turney and Littman have mined corpora to discover word-based analogies; via WordNet’s hypernyms, litOLAP supports exploration of category-based analogies. A fuller motivation for litOLAP, our DW’s schema and some example queries are given elsewhere [5, 6]; this poster primarily reports progress in implementing litOLAP. OLAP using our DW can be contrasted with published text analyses that either required substantial custom programming or had to fit within the narrower range of queries supported by existing programs [13, 4] We do not believe that the software used for business OLAP can support several of the literary analogy queries planned. Therefore, litOLAP includes a custom engine (see Figure 2) that processes our kinds of queries. End users do not interact directly with the engine, so the custom query language (which supports the usual OLAP operations: slice, dice, roll-up as well as some specialized operations) need not be easy to use — it must be fast.
Figure 2 shows the engine’s major components and technologies. After requesting that the engine open some “base cubes”, the application formulates queries (XML in Java strings), passes them
Lemur Library (C++) of Storage Methods
B+ tree cube (QDBM based)
rotated B+ tree cube
RDBMS cube (libpqxx based)
rollup() slice() dice() ...
rollup() slice() dice() ...
rollup() slice() dice() ...
rollup() slice() dice() ...
rollup() slice() dice() ...
SWIG interface from C++ to Java/Python/O’Caml cube and dimension cache
storage method chooser
results (cursor) metadata (XML)
dimension data (XML)
OLAP Cube slice() dice() rollup() glue() ETL()
UIs UI #1 (Jython)
Core Engine (Java)
UI #2 (Java) UI #3 (Web) Internet (AJAX) Remote User
Figure 2: Custom OLAP engine.
to the engine, and eventually receives the results. Development thus far has emphasized appropriate cube storage plus the pre-computation and caching of intermediate values. Keith’s thesis  describes experiments with the storage methods shown in Figure 2. There was no overall “best storage method”; rather, many factors affected whether a particular operation was fast for a particular cube. For example, the B+ Tree sliced efficiently only on the first dimension, while the linear-hashing cube was able to slice equally well on all dimensions. The storage-method chooser evaluates the properties of a cube (size, density, number of dimensions, etc.) and will eventually use predictions of the expected operations’ frequencies. (AnalogyCube is regularly sliced, for instance.) Its goal is selecting the most appropriate storage method when a cube is created. Intermediate cubes and dimensions are calculated during query processing. Earlier, a particular subexpression may have been computed and the resultant intermediate cube may still be cached, speeding up the query. Deciding which intermediate values to save (and also which possibly useful intermediate values to precompute speculatively) is the well-studied materialized-view problem for databases [3, 1]. Automatic rewriting of queries to use cached intermediates is hard and remains to be added to the expression evaluator.
Currently, the custom engine can answer a variety of “canned” test queries involving AnalogyCube or a PhraseCube (similar to word n-grams), and ETL processes the plain-text files on the Project Gutenberg  CD. Near-term enhancements include improving ETL quality, improving the 3
custom engine’s speed, and constructing a useful UI so that we can get end-user feedback. Longer term, the system needs to recognize when switching to specialized data structures(such as suffix arrays ) is advantageous. We believe that the overall idea of applying OLAP to literary data is promising. The initial custom engine is too slow for production use, but until more optimization is attempted, its promise is unclear.
References  Rada Chirkova, Alon Y. Halevy, and Dan Suciu. A formal perspective on the view selection problem. The VLDB Journal, 11(3):216–237, 2002.  E.F. Codd. Providing OLAP (on-line analytical processing) to user-analysis: an IT mandate. Technical report, E.F. Codd and Associates, 1993.  Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270–294, 2001.  Patrick Joula, John Sofko, and Patrick Brennan. A prototype for authorship attribution studies. Literary and Linguistic Computing, 21(2):169–178, 2006.  Steven Keith, Owen Kaser, and Daniel Lemire. Analyzing large collections of electronic text using OLAP. In APICS 2005, October 2005.  Steven Keith, Owen Kaser, and Daniel Lemire. Analyzing large collections of electronic text using OLAP. Technical Report TR-05-001, UNBSJ CSAS, June 2005.  Steven W. Keith. Efficient storage methods for a literary data warehouse. Master’s thesis, UNB, 2006.  Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993.  George A. Miller, Christiane Fellbaum, Randee Tengi, Susanne Wolff, Pamela Wakefield, Helen Langone, and Benjamin Haskell. Wordnet – a lexical database for the English language. http://wordnet.princeton.edu/.  Project Gutenberg Literary Archive Foundation. gutenberg.org/, 2006.
 Stephen Ramsay. Mining Shakespeare. In ACH/ALLC 2005. University of Victoria, June 2005.  Michael Stubbs. Conrad in the computer: examples of quantitative stylistic methods. Language and Literature, 2005.  TAPoR Project. TAPoR prototype of text analysis tools. online: http://taporware. mcmaster.ca/~taporware/betaTools/index.shtml, 2005. last accessed 27 June 2006.  P. D. Turney and M. L. Littman. Corpus-based learning of analogies and semantic relations. Machine Learning, 60(1–3):251–278, 2005. 4