-
-
Running head: Holographic Storage
Holographic Storage Devices
Shaun J Whittaker
Strayer University
March 30, 2010
Abstract:
As technology rises due to the computer age, developers are always seeking the best invention for computer users to conviently use operating systems. Holographic storage devices have been on the minds of developers since 1952. Due to developments in technology holographic storage devices are now on the rise. Holographic storage devices have the same design as other storage devices. The paper will focus on Multiplexing, Pixels, Data Transfer Rates, and Wavelengths. These components are like other system architectures, but more advanced when dealing with holographic devices. The paper will also discuss the benefits and drawbacks of holographic storage devices. The main question will be are consumers or companies ready for holographic storage devices?
Content of Problem:
Developers pursued holographic storage devices since 1960s and early 1970s. Holographic were just in the minds of science fiction writers and script makers. As technology gets better overtime holographic storage devices can be a reality for companies and consumers. Are we ready for such a technology? Let us look into the physical layout of the storage device first. Holographic storage uses the light coming from a very special electronic binary object known as a spatial light modulator or SLM (Coufal, Psaltis, & Sincerbox, 2000) in its basic form a hologram is the photographic record of the spatial interference pattern created by mixing of two coherent laser beams. One of the beams usually carries spatial information and is labeled the object beam. The other is distinguished by its particular direction of travel and is labeled the reference beam. As holographic material becomes thicker, the reconstruction becomes very sensitive to the particular angle of incidence of the reference beam which allows multiple objects to be recorded. A holographic data storage system is page oriented with each block of data bits that can be spatially impressed onto the object.
Holographic data storage can be constructed to exploit this principle by using a spatial light modulator to properly shape the object beam, an optical beam scanner to point the reference beam a detector array to convert the reconstructed output object data into an electronic bit stream, electronic control the I/O information. (Hong & Psaltis, 1996)
Purpose of the paper will discuss three main system architectures that help make up the design of holographic storage devices. Multiplexing, Pixels, and Data Transfer Rate which are main components of regular storage devices, but when dealing with holographic storage devices the architecture is more advanced in regards to faster data transfer rate, and advanced pixel bits, and multiplexing.
In practice the number of hologram can be stored and reliably retrieved from a common volume of material is limited to less than 10,000 so spatial multiplexing must be used. By multiplexing holographic storage devices into discs is the only way for holographs are to work on the disc. Cells are stored and retrieved by angular multiplexing, so the holographic storage device disc can rotate (Hong & Psaltis, 1996) .
The effective storage density bits can be increased by using thick recording layer to record multiple independent pages of data. Each page occupies a different depth in the recording volume. This process is called multiplexing. Retrieval of an individual page with minimum crosstalk from the other pages is a consequence of the volume nature of the recording and its behavior as a highly tuned structure.
Wavelength is also a great part of multiplexing. Wavelength and angle used during recording and playback are identical; the efficiency of the reconstruction is a maximum. As the wavelength a readout angle is changed from this condition the effiency decreases and becomes zero. This results thickness increasing interference and mismatches are tolerant. (Coufal, Psaltis, & Sincerbox, 2000)
Another system architecture that is essential to holographic storage devices are pixels. Holographic storage devices are based on advanced pixel bits. In order to demonstrate an image from hardware a pixel would have to be developed. When talking about holograms which are virtual picture displays it is very important to discuss pixels. “To use volume holography as a storage technology, digtial data must be imprinted onto the object beam for recording and then retrieved from the reconstructed object beam during read out. The device for putting data into the system is called a spatial light modulator discussed earlier in the paper regarding the physical layout. SLM is a planar array consisting of thousands of pixels. Each pixel is an independent microscopic shutter that can either block or pass light using liquid crystal or micro mirror arrays with 1000 x 800 elements. The pixels in both types of device can refreshed over 1000 times per second, allowing holographic data storage systems to reach input data rates of 1 gigabit per second. The data are read using an array of detector pixels, such as a CCD camera or a semiconductor sensor. The object beam often passes through a set of lenses that image the SLM pixel pattern onto the output pixel array. The maximize the storage density, the hologram is usually recorded where the object beam is tightly focused.
The read out rate is often dictated by the camera integration time: the reference beam reconstructs a hologram until a sufficient number of photons accumulate to differentiate bright and dark pixels. The goal of the storage device is to make 1000 pages of data that be retrieved per second” (Burr, 2000).
In most data storage systems designers maximize important figures of merit such as the storage density and data rate by pushing the physical components of the system well beyond the point where the system is error free. Coding and signal processing algorithms are then introduced to reduce the proportion of erroneous bits to acceptable levels.
Bur also defines an IBM Prism tester DEMON I which tests the properties of holographic storage components. The DEMON I codes and signals through algorithms in order to distract the noise. As the paper discusses later there are some problems in these storage devices. Noise is one of them in which brings poor imaging into the hologram in which multiplexing, and cross talking come into play. Although Noise is a problem it good in away according the Bur “The fundamental tradeoff between the levels of signal and noise is range of storage material. As the number of holograms or the read out rate increase the amount of power diffracted towards the detector array decreases, reducing the signal to noise ratio and increasing the number of incorrect bits” (Burr, 2000).
The Final system architecture that would be discussed is Data Transfer Rate. Data Transfer Rate within Holographic storage rates is faster than a Blue Ray or DVD. It is important to discuss this because data transfer rate is important when talking about a new storage device. Developers would want to create a device that has the capabilities to be faster than the previous data storages in the past. “Holographic storage systems promise discs that have 300Gybytes of storage capacity. That is 200 times more than a single sided DVD and 20 times more than a current double sided Blu-ray disc. Holographic storage has improved durability over magnetic devices such as tape drives in which the media and drive head are in contact. This proximity can cause problems with both the media and the read heads, and recovering data stored on tape can be problematic if the tape is old or has been used many times. Holography promises incredibly high transfer rates. Up to 1Gbyte per second 40 times faster than DVD. Holographic systems stores and retrieves an entire page of data about 60,000 bits of information in one pulse of light. A DVD can only transfer one bit of data per pulse of light” (Adshead, 2007).
Statement of Problem:
As we look into the design and problems of Holographic Storage Devices. Are Holographic storage devices ready for consumers?
“On the plus side, long term media stability and reliability is a compelling advantage for deep archiving purposes—discs and tape simply cannot assure reliability out to 50years.
On the downside early holographic storage drives will cost $10,000, with media costing about 100 per disc” (Holographic data storage: the next big thing, 2007).
Significance of Study
The paper looked at some key components of holographic storage devices. The main question of the paper is holographic storage devices ready for consumers and companies. According to an article in Communications of the Ach Orlov states: “Commercialization depends on how long optical storage devices will stay around in the mainstream. Will DVDs require some kind of volumetric storage scheme like holographic devices? Orlov thinks that holographic storage devices will arise within 10 years. But holographic data storage will require a new, and costly manufacturing infrastructure to produce competitively priced products in sufficient quantity and quality to satisfy the storage market, which already has lots of high performance options (Orlov, 2000).
Holographic memory would be attractive and desirable with the growing commercial interest in data mining which involves sifting through vast amounts of information in order to find useful relationships. According to Dr Coufal it was the cause of IBM’s renewed interest in holographic storage states: “It sounds promising. Yet a lot of work still needs to be done before holographic storage devices become fully commercial. Meeting the exacting requirements for aligning lasers, detectors, and spatial light modulators in a low cost system remains difficult. However, the recent availability of low cost components, such as solid state camera chips, from rapidly maturing opto electronics industry has provided the devices needed to build holographic memories on a large scale (IBM, 2003).
The most difficult problem with holographic memory involves the storage materials. Most are inorganic photo refractive crystals such as lithium niobate, barium titanate, or strontium barium niobate, doped with transition metals such as iron or rare earth ions. When exposed to light these materials generate response to the intensity of light in the inference pattern. Because subsequent illuminations can reshift required for readout also gradually erases the charge distribution and thus the data. Charge distributions can be fixed in place by raising the temperature of the crystals, allowing the ions to move as well as the electrons. This creates a read only memory (Lerner, 199).
This paper’s population will effect developers, information system professionals and students interesting in an advanced topics of system architeturce. There are 43.0 million computer science students in graduate and post doctorate classes that could learn about holographic storage devices. Another population that this study could effect is computer systems designers in which there are 93.1 million, they could benefit from the study, of holographic storage devices.
This paper focused its research on the components of holographic storage devices. (SLM) spatial light modulator that reflects laser beams into objects to be recorded. The device is converted into bits and then I/O information is stored. Multiplexing is a great part of holographic devices because it records layers into pages. It controls crosstalk and noise of the object being reflected. Noise can be bad because it distorts the images along with crosstalk. Pixels are key because the device needs to reflect a picture. IBM developed DEMON I to the device to fix distortion problems. Data Transfer Rate is essential because it involves the speed of the device. The speed is 200 more times greater than a DVD making the holographic device a good prospect.
Although holographic devices would be great for storage it has kinks. The main reason is the crystal liquids that make up the hologram cost a considerable amount of money. It was proposed by developers that holographic devices would arise this year, developers found such hope in the product through my research. But holographic storage devices are not in the stores, or are they in companies. Research proves that it is too expensive to develop such technology. Some say that when optical devices cease then holographs will arise. No one knows for sure when the storage device will be on the market. I would predict in the next ten years will the technology of science fiction writers can become reality.
First generation CD ROM discs, for example, held over 600 MB of data, hundreds of times the capacity of a floppy disk, which was the dominant form of removable media at the time. Even hard disks of the time typically had capacities in the tens on megabytes. Optical storage had a huge advantage in the quest for raw capacity. Of course, things changed over time. Even consumer-grade hard disks today can store multiple terabytes of data, and SD cards, USB flash drives and USB hard drives are now the preferred forms of removable storage. Most computers these days do not even have an optical drive. However, that may eventually change, thanks to holographic storage.
Holographic storage is one of those technologies that has existed seemingly forever, but never really took off. Like other forms of optical media, holographic storage devices use lasers to read and write data. However, the similarities between holographic storage and legacy optical storage technologies such as DVD or Blue Ray end there.
Holographic storage, at least as it currently exists, is a write once read many (WORM) medium. It was intended for the long-term storage of archival data. But it isn’t just the fact that it cannot be overwritten that makes holographic storage so well suited to the task of data archiving. It is also the fact that holographic storage offers tremendous capacity and blazingly fast read speeds--at least in theory.
First-generation devices failed to live up to the technology’s potential. A company called InPhase technologies, for example, created a removable holographic storage medium that was made commercially available. Although the storage medium itself was only about the size of a DVD RAM cartridge, the drive used by the medium was both huge and expensive. Worse yet, the holographic storage medium held only about 300 GB of data, and its maximum transfer rate was only about 20 MB per second.
Created an optical storage technology that could store 1 TB on a disc the size of a DVD, but the technology never took off.
Projected to be the size of a DVD, with a theoretical storage capacity of 360 TB. Furthermore, the medium is designed to avoid bit rot for billions of years. Although the technology has been said to perform well, performance benchmarks have not been made publicly available. Even so, performance is only of secondary concern. The medium’s primary goal is to create high capacity storage that is truly permanent--not necessarily fast.
Inside Holographic Storage
So how does holographic storage work? There have been several different holographic storage devices created over the years, each working in its own unique way. Most of these devices use multiple laser beams (or split beams) to allow data to be written to and read from optical storage three dimensionally.
The University of Southampton researchers have used a completely different approach. While their device does use lasers, it is said to store data in five dimensions. Of course, this does not refer to fifth dimensional storage. In physics, the fifth dimension refers to invariant properties of space-time. Instead, researchers at the University of South Hampton use the phrase “five dimensions” to refer to the five different characteristics that are used for the storage of data.
The first three dimensions of storage are exactly what you would probably expect them to be. These dimensions are essentially the X, Y and Z axis, or height, width, and depth. The fourth dimension refers in this case to the physical size of a “data dot” (in physics, the fourth dimension refers to the passage of time). The so-called fifth dimension is the data’s offset, or how the “data dot” is aligned on the media. Each of these properties can be interpreted as a value, thereby contributing to the medium’s massive storage capacity.
But what about the medium’s longevity? The reason why this particular form of holographic storage can be discussed in terms of permanence is because it is essentially made from rock. The five dimensions are made up of nano-structures that have been constructed from quartz crystal.
For right now, this form of holographic storage is not commercially available, but we expect that at some point it will be. In the meantime, the university has been permanently preserving the world’s greatest literary works in holographic storage.
Research Background:
Holographic memory offers the possibility of storing 1 terabyte (TB) of data in a sugar-cube-sized crystal. A terabyte of data equals 1,000 gigabytes, 1 million megabytes or 1 trillion Bytes. Data from more than 1,000 CDs could fit on a holographic memory system. Most computer Hard Drives only hold 10 to 40 GB of data, a small fraction of what a holographic memory system might hold.
Scientist Pieter J. van Heerden first proposed the idea of holographic (three-dimensional) storage in the early 1960s. A decade later, scientists at RCA Laboratories demonstrated the technology by recording 500 holograms in an iron-doped lithium-niobate crystal, and 550 holograms of high-resolution images in a light-sensitive polymer material. The lack of cheap parts and the advancement of magnetic and SemiConductor memories placed the development of holographic data storage on hold.
Content of the Problem
Prototypes developed by Lucent and IBM differ slightly, but most holographic data storage systems (HDSS) are based on the same concept. Here are the basic components that are needed to construct an HDSS:
- Blue-green argon Laser
- Beam splitters to spilt the laser beam
- Mirrors to direct the laser beams
- LCDpanel (spatial light modulator)
- Lenses to focus the laser beams
- Lithium-niobate crystal or photopolymer
- Charge-coupled device (CCD) camera
When the blue-green argon laser is fired, a beam splitter creates two beams. One beam, called the object or signal beam, will go straight, bounce off one mirror and travel through a spatial-light modulator (SLM). An SLM is a Liquid Crystal Display(LCD) that shows pages of raw binary data as clear and dark boxes. The information from the page of binary code is carried by the signal beam around to the light-sensitive lithium-niobate crystal. Some systems use a photopolymer in place of the crystal. A second beam, called the reference beam, shoots out the side of the beam splitter and takes a separate path to the crystal. When the two beams meet, the interference pattern that is created stores the data carried by the signal beam in a specific area in the crystal -- the data is stored as a hologram.
An advantage of a holographic memory system is that an entire page of data can be retrieved quickly and at one time. In order to retrieve and reconstruct the holographic page of data stored in the crystal, the reference beam is shined into the crystal at exactly the same angle at which it entered to store that page of data. Each page of data is stored in a different area of the crystal, based on the angle at which the reference beam strikes it. During reconstruction, the beam will be diffracted by the crystal to allow the recreation of the original page that was stored. This reconstructed page is then projected onto the charge-coupled device (CCD) camera, which interprets and forwards the digital information to a Computer
The key component of any holographic data storage system is the angle at which the second reference beam is fired at the crystal to retrieve a page of data. It must match the original reference beam angle exactly. A difference of just a thousandth of a millimeter will result in failure to retrieve that page of data.
- After more than 30 years of research and development, a desktop holographic storage system (HDSS) is close at hand. Early holographic data storage devices will have capacities of 125 GB and transfer rates of about 40 MB per second. Eventually, these devices could have storage capacities of 1 TB and data rates of more than 1 GB per second -- fast enough to transfer an entire DVD movie in 30 seconds. So why has it taken so long to develop an HDSS, and what is there left to do?
- When the idea of an HDSS was first proposed, the components for constructing such a device were much larger and more expensive. For example, a Laser for such a system in the 1960s would have been 6 feet long. Now, with the development of consumer electronics, a laser similar to those used in CD Players could be used for the HDSS. LCDs weren't even developed until 1968, and the first ones were very expensive. Today, LCDs are much cheaper and more complex than those developed 30 years ago. Additionally, a CCCD Sensor wasn't available until the last decade. Almost the entire HDSS device can now be made from off-the-shelf components, which means that it could be mass-produced.
- Although HDSS components are easier to come by today than they were in the 1960s, there are still some technical problems that need to be worked out. For example, if too many pages are stored in one crystal, the strength of each hologram is diminished. If there are too many holograms stored on a crystal, and the reference laser used to retrieve a hologram is not shined at the precise angle, a hologram will pick up a lot of background from the other holograms stored around it. It is also a challenge to align all of these components in a low-cost system.
In Conclusion
10.1 Overview of Mass-Storage Structure
10.1.1 Magnetic Disks
- Traditional magnetic disks have the following basic structure:
- One or more platters in the form of disks covered with magnetic media. Hard disk platters are made of rigid metal, while "floppy" disks are made of more flexible plastic.
- Each platter has two working surfaces. Older hard disk drives would sometimes not use the very top or bottom surface of a stack of platters, as these surfaces were more susceptible to potential damage.
- Each working surface is divided into a number of concentric rings called tracks. The collection of all tracks that are the same distance from the edge of the platter, ( i.e. all tracks immediately above one another in the following diagram ) is called a cylinder.
- Each track is further divided into sectors, traditionally containing 512 bytes of data each, although some modern disks occasionally use larger sector sizes. ( Sectors also include a header and a trailer, including checksum information among other things. Larger sector sizes reduce the fraction of the disk consumed by headers and trailers, but increase internal fragmentation and the amount of disk that must be marked bad in the case of errors. )
- The data on a hard drive is read by read-write heads. The standard configuration ( shown below ) uses one head per surface, each on a separate arm, and controlled by a common arm assembly which moves all heads simultaneously from one cylinder to another. ( Other configurations, including independent read-write heads, may speed up disk access, but involve serious technical difficulties. )
- The storage capacity of a traditional disk drive is equal to the number of heads ( i.e. the number of working surfaces ), times the number of tracks per surface, times the number of sectors per track, times the number of bytes per sector. A particular physical block of data is specified by providing the head-sector-cylinder number at which it is located.
Figure 10.1 - Moving-head disk mechanism.
- In operation the disk rotates at high speed, such as 7200 rpm ( 120 revolutions per second. ) The rate at which data can be transferred from the disk to the computer is composed of several steps:
- The positioning time, a.k.a. the seek time or random access time is the time required to move the heads from one cylinder to another, and for the heads to settle down after the move. This is typically the slowest step in the process and the predominant bottleneck to overall transfer rates.
- The rotational latency is the amount of time required for the desired sector to rotate around and come under the read-write head.This can range anywhere from zero to one full revolution, and on the average will equal one-half revolution. This is another physical step and is usually the second slowest step behind seek time. ( For a disk rotating at 7200 rpm, the average rotational latency would be 1/2 revolution / 120 revolutions per second, or just over 4 milliseconds, a long time by computer standards.
- The transfer rate, which is the time required to move the data electronically from the disk to the computer. ( Some authors may also use the term transfer rate to refer to the overall transfer rate, including seek time and rotational latency as well as the electronic data transfer rate. )
- Disk heads "fly" over the surface on a very thin cushion of air. If they should accidentally contact the disk, then a head crash occurs, which may or may not permanently damage the disk or even destroy it completely. For this reason it is normal to park the disk heads when turning a computer off, which means to move the heads off the disk or to an area of the disk where there is no data stored.
- Floppy disks are normally removable. Hard drives can also be removable, and some are even hot-swappable, meaning they can be removed while the computer is running, and a new hard drive inserted in their place.
- Disk drives are connected to the computer via a cable known as the I/O Bus. Some of the common interface formats include Enhanced Integrated Drive Electronics, EIDE; Advanced Technology Attachment, ATA; Serial ATA, SATA, Universal Serial Bus, USB; Fiber Channel, FC, and Small Computer Systems Interface, SCSI.
- The host controller is at the computer end of the I/O bus, and the disk controller is built into the disk itself. The CPU issues commands to the host controller via I/O ports. Data is transferred between the magnetic surface and onboard cache by the disk controller, and then the data is transferred from that cache to the host controller and the motherboard memory at electronic speeds.
10.1.2 Solid-State Disks - New
- As technologies improve and economics change, old technologies are often used in different ways. One example of this is the increasing used of solid state disks, or SSDs.
- SSDs use memory technology as a small fast hard disk. Specific implementations may use either flash memory or DRAM chips protected by a battery to sustain the information through power cycles.
- Because SSDs have no moving parts they are much faster than traditional hard drives, and certain problems such as the scheduling of disk accesses simply do not apply.
- However SSDs also have their weaknesses: They are more expensive than hard drives, generally not as large, and may have shorter life spans.
- SSDs are especially useful as a high-speed cache of hard-disk information that must be accessed quickly. One example is to store filesystem meta-data, e.g. directory and inode information, that must be accessed quickly and often. Another variation is a boot disk containing the OS and some application executables, but no vital user data. SSDs are also used in laptops to make them smaller, faster, and lighter.
- Because SSDs are so much faster than traditional hard disks, the throughput of the bus can become a limiting factor, causing some SSDs to be connected directly to the system PCI bus for example.
10.1.3 Magnetic Tapes - was 12.1.2
- Magnetic tapes were once used for common secondary storage before the days of hard disk drives, but today are used primarily for backups.
- Accessing a particular spot on a magnetic tape can be slow, but once reading or writing commences, access speeds are comparable to disk drives.
- Capacities of tape drives can range from 20 to 200 GB, and compression can double that capacity.
10.2 Disk Structure
- The traditional head-sector-cylinder, HSC numbers are mapped to linear block addresses by numbering the first sector on the first head on the outermost track as sector 0. Numbering proceeds with the rest of the sectors on that same track, and then the rest of the tracks on the same cylinder before proceeding through the rest of the cylinders to the center of the disk. In modern practice these linear block addresses are used in place of the HSC numbers for a variety of reasons:
- The linear length of tracks near the outer edge of the disk is much longer than for those tracks located near the center, and therefore it is possible to squeeze many more sectors onto outer tracks than onto inner ones.
- All disks have some bad sectors, and therefore disks maintain a few spare sectors that can be used in place of the bad ones. The mapping of spare sectors to bad sectors in managed internally to the disk controller.
- Modern hard drives can have thousands of cylinders, and hundreds of sectors per track on their outermost tracks. These numbers exceed the range of HSC numbers for many ( older ) operating systems, and therefore disks can be configured for any convenient combination of HSC values that falls within the total number of sectors physically on the drive.
- There is a limit to how closely packed individual bits can be placed on a physical media, but that limit is growing increasingly more packed as technological advances are made.
- Modern disks pack many more sectors into outer cylinders than inner ones, using one of two approaches:
- With Constant Linear Velocity, CLV, the density of bits is uniform from cylinder to cylinder. Because there are more sectors in outer cylinders, the disk spins slower when reading those cylinders, causing the rate of bits passing under the read-write head to remain constant. This is the approach used by modern CDs and DVDs.
- With Constant Angular Velocity, CAV, the disk rotates at a constant angular speed, with the bit density decreasing on outer cylinders. ( These disks would have a constant number of sectors per track on all cylinders. )
10.3 Disk Attachment
Disk drives can be attached either directly to a particular host ( a local disk ) or to a network.
10.3.1 Host-Attached Storage
- Local disks are accessed through I/O Ports as described earlier.
- The most common interfaces are IDE or ATA, each of which allow up to two drives per host controller.
- SATA is similar with simpler cabling.
- High end workstations or other systems in need of larger number of disks typically use SCSI disks:
- The SCSI standard supports up to 16 targets on each SCSI bus, one of which is generally the host adapter and the other 15 of which can be disk or tape drives.
- A SCSI target is usually a single drive, but the standard also supports up to 8 units within each target. These would generally be used for accessing individual disks within a RAID array. ( See below. )
- The SCSI standard also supports multiple host adapters in a single computer, i.e. multiple SCSI busses.
- Modern advancements in SCSI include "fast" and "wide" versions, as well as SCSI-2.
- SCSI cables may be either 50 or 68 conductors. SCSI devices may be external as well as internal.
- See wikipedia for more information on the SCSI interface.
- FC is a high-speed serial architecture that can operate over optical fiber or four-conductor copper wires, and has two variants:
- A large switched fabric having a 24-bit address space. This variant allows for multiple devices and multiple hosts to interconnect, forming the basis for the storage-area networks, SANs, to be discussed in a future section.
- The arbitrated loop, FC-AL, that can address up to 126 devices ( drives and controllers. )
10.3.2 Network-Attached Storage
- Network attached storage connects storage devices to computers using a remote procedure call, RPC, interface, typically with something like NFS filesystem mounts. This is convenient for allowing several computers in a group common access and naming conventions for shared storage.
- NAS can be implemented using SCSI cabling, or ISCSI uses Internet protocols and standard network connections, allowing long-distance remote access to shared files.
- NAS allows computers to easily share data storage, but tends to be less efficient than standard host-attached storage.
Figure 10.2 - Network-attached storage.
10.3.3 Storage-Area Network
- A Storage-Area Network, SAN, connects computers and storage devices in a network, using storage protocols instead of network protocols.
- One advantage of this is that storage access does not tie up regular networking bandwidth.
- SAN is very flexible and dynamic, allowing hosts and devices to attach and detach on the fly.
- SAN is also controllable, allowing restricted access to certain hosts and devices.
Figure 10.3 - Storage-area network.
10.4 Disk Scheduling
- As mentioned earlier, disk transfer speeds are limited primarily by seek times and rotational latency. When multiple requests are to be processed there is also some inherent delay in waiting for other requests to be processed.
- Bandwidth is measured by the amount of data transferred divided by the total amount of time from the first request being made to the last transfer being completed, ( for a series of disk requests. )
- Both bandwidth and access time can be improved by processing requests in a good order.
- Disk requests include the disk address, memory address, number of sectors to transfer, and whether the request is for reading or writing.
10.4.1 FCFS Scheduling
- First-Come First-Serve is simple and intrinsically fair, but not very efficient. Consider in the following sequence the wild swing from cylinder 122 to 14 and then back to 124:
Figure 10.4 - FCFS disk scheduling.
10.4.2 SSTF Scheduling
- Shortest Seek Time First scheduling is more efficient, but may lead to starvation if a constant stream of requests arrives for the same general area of the disk.
- SSTF reduces the total head movement to 236 cylinders, down from 640 required for the same set of requests under FCFS. Note, however that the distance could be reduced still further to 208 by starting with 37 and then 14 first before processing the rest of the requests.
Figure 10.5 - SSTF disk scheduling.
10.4.3 SCAN Scheduling
- The SCAN algorithm, a.k.a. the elevator algorithm moves back and forth from one end of the disk to the other, similarly to an elevator processing requests in a tall building.
Figure 10.6 - SCAN disk scheduling.
- Under the SCAN algorithm, If a request arrives just ahead of the moving head then it will be processed right away, but if it arrives just after the head has passed, then it will have to wait for the head to pass going the other way on the return trip. This leads to a fairly wide variation in access times which can be improved upon.
- Consider, for example, when the head reaches the high end of the disk: Requests with high cylinder numbers just missed the passing head, which means they are all fairly recent requests, whereas requests with low numbers may have been waiting for a much longer time. Making the return scan from high to low then ends up accessing recent requests first and making older requests wait that much longer.
10.4.4 C-SCAN Scheduling
- The Circular-SCAN algorithm improves upon SCAN by treating all requests in a circular queue fashion - Once the head reaches the end of the disk, it returns to the other end without processing any requests, and then starts again from the beginning of the disk:
Figure 10.7 - C-SCAN disk scheduling.
12.4.5 LOOK Scheduling
- LOOK scheduling improves upon SCAN by looking ahead at the queue of pending requests, and not moving the heads any farther towards the end of the disk than is necessary. The following diagram illustrates the circular form of LOOK:
Figure 10.8 - C-LOOK disk scheduling.
10.4.6 Selection of a Disk-Scheduling Algorithm
- With very low loads all algorithms are equal, since there will normally only be one request to process at a time.
- For slightly larger loads, SSTF offers better performance than FCFS, but may lead to starvation when loads become heavy enough.
- For busier systems, SCAN and LOOK algorithms eliminate starvation problems.
- The actual optimal algorithm may be something even more complex than those discussed here, but the incremental improvements are generally not worth the additional overhead.
- Some improvement to overall filesystem access times can be made by intelligent placement of directory and/or inode information. If those structures are placed in the middle of the disk instead of at the beginning of the disk, then the maximum distance from those structures to data blocks is reduced to only one-half of the disk size. If those structures can be further distributed and furthermore have their data blocks stored as close as possible to the corresponding directory structures, then that reduces still further the overall time to find the disk block numbers and then access the corresponding data blocks.
- On modern disks the rotational latency can be almost as significant as the seek time, however it is not within the OSes control to account for that, because modern disks do not reveal their internal sector mapping schemes, ( particularly when bad blocks have been remapped to spare sectors. )
- Some disk manufacturers provide for disk scheduling algorithms directly on their disk controllers, ( which do know the actual geometry of the disk as well as any remapping ), so that if a series of requests are sent from the computer to the controller then those requests can be processed in an optimal order.
- Unfortunately there are some considerations that the OS must take into account that are beyond the abilities of the on-board disk-scheduling algorithms, such as priorities of some requests over others, or the need to process certain requests in a particular order. For this reason OSes may elect to spoon-feed requests to the disk controller one at a time in certain situations.
10.5 Disk Management
105.1 Disk Formatting
- Before a disk can be used, it has to be low-level formatted, which means laying down all of the headers and trailers marking the beginning and ends of each sector. Included in the header and trailer are the linear sector numbers, and error-correcting codes, ECC, which allow damaged sectors to not only be detected, but in many cases for the damaged data to be recovered ( depending on the extent of the damage. ) Sector sizes are traditionally 512 bytes, but may be larger, particularly in larger drives.
- ECC calculation is performed with every disk read or write, and if damage is detected but the data is recoverable, then a soft error has occurred. Soft errors are generally handled by the on-board disk controller, and never seen by the OS. ( See below. )
- Once the disk is low-level formatted, the next step is to partition the drive into one or more separate partitions. This step must be completed even if the disk is to be used as a single large partition, so that the partition table can be written to the beginning of the disk.
- After partitioning, then the filesystems must be logically formatted, which involves laying down the master directory information ( FAT table or inode structure ), initializing free lists, and creating at least the root directory of the filesystem. ( Disk partitions which are to be used as raw devices are not logically formatted. This saves the overhead and disk space of the filesystem structure, but requires that the application program manage its own disk storage requirements. )
10.5.2 Boot Block
- Computer ROM contains a bootstrap program ( OS independent ) with just enough code to find the first sector on the first hard drive on the first controller, load that sector into memory, and transfer control over to it. ( The ROM bootstrap program may look in floppy and/or CD drives before accessing the hard drive, and is smart enough to recognize whether it has found valid boot code or not. )
- The first sector on the hard drive is known as the Master Boot Record, MBR, and contains a very small amount of code in addition to the partition table. The partition table documents how the disk is partitioned into logical disks, and indicates specifically which partition is the active or boot partition.
- The boot program then looks to the active partition to find an operating system, possibly loading up a slightly larger / more advanced boot program along the way.
- In a dual-boot ( or larger multi-boot ) system, the user may be given a choice of which operating system to boot, with a default action to be taken in the event of no response within some time frame.
- Once the kernel is found by the boot program, it is loaded into memory and then control is transferred over to the OS. The kernel will normally continue the boot process by initializing all important kernel data structures, launching important system services ( e.g. network daemons, sched, init, etc. ), and finally providing one or more login prompts. Boot options at this stage may include single-user a.k.a. maintenance or safe modes, in which very few system services are started - These modes are designed for system administrators to repair problems or otherwise maintain the system.
Figure 10.9 - Booting from disk in Windows 2000.
10.5.3 Bad Blocks
- No disk can be manufactured to 100% perfection, and all physical objects wear out over time. For these reasons all disks are shipped with a few bad blocks, and additional blocks can be expected to go bad slowly over time. If a large number of blocks go bad then the entire disk will need to be replaced, but a few here and there can be handled through other means.
- In the old days, bad blocks had to be checked for manually. Formatting of the disk or running certain disk-analysis tools would identify bad blocks, and attempt to read the data off of them one last time through repeated tries. Then the bad blocks would be mapped out and taken out of future service. Sometimes the data could be recovered, and sometimes it was lost forever. ( Disk analysis tools could be either destructive or non-destructive. )
- Modern disk controllers make much better use of the error-correcting codes, so that bad blocks can be detected earlier and the data usually recovered. ( Recall that blocks are tested with every write as well as with every read, so often errors can be detected before the write operation is complete, and the data simply written to a different sector instead. )
- Note that re-mapping of sectors from their normal linear progression can throw off the disk scheduling optimization of the OS, especially if the replacement sector is physically far away from the sector it is replacing. For this reason most disks normally keep a few spare sectors on each cylinder, as well as at least one spare cylinder. Whenever possible a bad sector will be mapped to another sector on the same cylinder, or at least a cylinder as close as possible. Sector slipping may also be performed, in which all sectors between the bad sector and the replacement sector are moved down by one, so that the linear progression of sector numbers can be maintained.
- If the data on a bad block cannot be recovered, then a hard error has occurred., which requires replacing the file(s) from backups, or rebuilding them from scratch.
10.6 Swap-Space Management
- Modern systems typically swap out pages as needed, rather than swapping out entire processes. Hence the swapping system is part of the virtual memory management system.
- Managing swap space is obviously an important task for modern OSes.
10.6.1 Swap-Space Use
- The amount of swap space needed by an OS varies greatly according to how it is used. Some systems require an amount equal to physical RAM; some want a multiple of that; some want an amount equal to the amount by which virtual memory exceeds physical RAM, and some systems use little or none at all!
- Some systems support multiple swap spaces on separate disks in order to speed up the virtual memory system.
10.6.2 Swap-Space Location
Swap space can be physically located in one of two locations:
- As a large file which is part of the regular filesystem. This is easy to implement, but inefficient. Not only must the swap space be accessed through the directory system, the file is also subject to fragmentation issues. Caching the block location helps in finding the physical blocks, but that is not a complete fix.
- As a raw partition, possibly on a separate or little-used disk. This allows the OS more control over swap space management, which is usually faster and more efficient. Fragmentation of swap space is generally not a big issue, as the space is re-initialized every time the system is rebooted. The downside of keeping swap space on a raw partition is that it can only be grown by repartitioning the hard drive.
12.6.3 Swap-Space Management: An Example
- Historically OSes swapped out entire processes as needed. Modern systems swap out only individual pages, and only as needed. ( For example process code blocks and other blocks that have not been changed since they were originally loaded are normally just freed from the virtual memory system rather than copying them to swap space, because it is faster to go find them again in the filesystem and read them back in from there than to write them out to swap space and then read them back. )
- In the mapping system shown below for Linux systems, a map of swap space is kept in memory, where each entry corresponds to a 4K block in the swap space. Zeros indicate free slots and non-zeros refer to how many processes have a mapping to that particular block ( >1 for shared pages only. )
Figure 10.10 - The data structures for swapping on Linux systems.
10.7 RAID Structure
- The general idea behind RAID is to employ a group of hard drives together with some form of duplication, either to increase reliability or to speed up operations, ( or sometimes both. )
- RAID originally stood for Redundant Array of Inexpensive Disks, and was designed to use a bunch of cheap small disks in place of one or two larger more expensive ones. Today RAID systems employ large possibly expensive disks as their components, switching the definition to Independent disks.
10.7.1 Improvement of Reliability via Redundancy
- The more disks a system has, the greater the likelihood that one of them will go bad at any given time. Hence increasing disks on a system actually decreases the Mean Time To Failure, MTTF of the system.
- If, however, the same data was copied onto multiple disks, then the data would not be lost unless both ( or all ) copies of the data were damaged simultaneously, which is a MUCH lower probability than for a single disk going bad. More specifically, the second disk would have to go bad before the first disk was repaired, which brings the Mean Time To Repair into play. For example if two disks were involved, each with a MTTF of 100,000 hours and a MTTR of 10 hours, then the Mean Time to Data Loss would be 500 * 10^6 hours, or 57,000 years!
- This is the basic idea behind disk mirroring, in which a system contains identical data on two or more disks.
- Note that a power failure during a write operation could cause both disks to contain corrupt data, if both disks were writing simultaneously at the time of the power failure. One solution is to write to the two disks in series, so that they will not both become corrupted ( at least not in the same way ) by a power failure. And alternate solution involves non-volatile RAM as a write cache, which is not lost in the event of a power failure and which is protected by error-correcting codes.
10.7.2 Improvement in Performance via Parallelism
- There is also a performance benefit to mirroring, particularly with respect to reads. Since every block of data is duplicated on multiple disks, read operations can be satisfied from any available copy, and multiple disks can be reading different data blocks simultaneously in parallel. ( Writes could possibly be sped up as well through careful scheduling algorithms, but it would be complicated in practice. )
- Another way of improving disk access time is with striping, which basically means spreading data out across multiple disks that can be accessed simultaneously.
- With bit-level striping the bits of each byte are striped across multiple disks. For example if 8 disks were involved, then each 8-bit byte would be read in parallel by 8 heads on separate disks. A single disk read would access 8 * 512 bytes = 4K worth of data in the time normally required to read 512 bytes. Similarly if 4 disks were involved, then two bits of each byte could be stored on each disk, for 2K worth of disk access per read or write operation.
- Block-level striping spreads a filesystem across multiple disks on a block-by-block basis, so if block N were located on disk 0, then block N + 1 would be on disk 1, and so on. This is particularly useful when filesystems are accessed in clusters of physical blocks. Other striping possibilities exist, with block-level striping being the most common.
10.7.3 RAID Levels
- Mirroring provides reliability but is expensive; Striping improves performance, but does not improve reliability. Accordingly there are a number of different schemes that combine the principals of mirroring and striping in different ways, in order to balance reliability versus performance versus cost. These are described by different RAID levels, as follows: ( In the diagram that follows, "C" indicates a copy, and "P" indicates parity, i.e. checksum bits. )
- Raid Level 0 - This level includes striping only, with no mirroring.
- Raid Level 1 - This level includes mirroring only, no striping.
- Raid Level 2 - This level stores error-correcting codes on additional disks, allowing for any damaged data to be reconstructed by subtraction from the remaining undamaged data. Note that this scheme requires only three extra disks to protect 4 disks worth of data, as opposed to full mirroring. ( The number of disks required is a function of the error-correcting algorithms, and the means by which the particular bad bit(s) is(are) identified. )
- Raid Level 3 - This level is similar to level 2, except that it takes advantage of the fact that each disk is still doing its own error-detection, so that when an error occurs, there is no question about which disk in the array has the bad data. As a result a single parity bit is all that is needed to recover the lost data from an array of disks. Level 3 also includes striping, which improves performance. The downside with the parity approach is that every disk must take part in every disk access, and the parity bits must be constantly calculated and checked, reducing performance. Hardware-level parity calculations and NVRAM cache can help with both of those issues. In practice level 3 is greatly preferred over level 2.
- Raid Level 4 - This level is similar to level 3, employing block-level striping instead of bit-level striping. The benefits are that multiple blocks can be read independently, and changes to a block only require writing two blocks ( data and parity ) rather than involving all disks. Note that new disks can be added seamlessly to the system provided they are initialized to all zeros, as this does not affect the parity results.
- Raid Level 5 - This level is similar to level 4, except the parity blocks are distributed over all disks, thereby more evenly balancing the load on the system. For any given block on the disk(s), one of the disks will hold the parity information for that block and the other N-1 disks will hold the data. Note that the same disk cannot hold both data and parity for the same block, as both would be lost in the event of a disk crash.
- Raid Level 6 - This level extends raid level 5 by storing multiple bits of error-recovery codes, ( such as the Reed-Solomon codes ), for each bit position of data, rather than a single parity bit. In the example shown below 2 bits of ECC are stored for every 4 bits of data, allowing data recovery in the face of up to two simultaneous disk failures. Note that this still involves only 50% increase in storage needs, as opposed to 100% for simple mirroring which could only tolerate a single disk failure.
Figure 10.11 - RAID levels.
- There are also two RAID levels which combine RAID levels 0 and 1 ( striping and mirroring ) in different combinations, designed to provide both performance and reliability at the expense of increased cost.
- RAID level 0 + 1 disks are first striped, and then the striped disks mirrored to another set. This level generally provides better performance than RAID level 5.
- RAID level 1 + 0 mirrors disks in pairs, and then stripes the mirrored pairs. The storage capacity, performance, etc. are all the same, but there is an advantage to this approach in the event of multiple disk failures, as illustrated below:.
- In diagram (a) below, the 8 disks have been divided into two sets of four, each of which is striped, and then one stripe set is used to mirror the other set.
- If a single disk fails, it wipes out the entire stripe set, but the system can keep on functioning using the remaining set.
- However if a second disk from the other stripe set now fails, then the entire system is lost, as a result of two disk failures.
- In diagram (b), the same 8 disks are divided into four sets of two, each of which is mirrored, and then the file system is striped across the four sets of mirrored disks.
- If a single disk fails, then that mirror set is reduced to a single disk, but the system rolls on, and the other three mirror sets continue mirroring.
- Now if a second disk fails, ( that is not the mirror of the already failed disk ), then another one of the mirror sets is reduced to a single disk, but the system can continue without data loss.
- In fact the second arrangement could handle as many as four simultaneously failed disks, as long as no two of them were from the same mirror pair.
Figure 10.12 - RAID 0 + 1 and 1 + 0
10.7.4 Selecting a RAID Level
- Trade-offs in selecting the optimal RAID level for a particular application include cost, volume of data, need for reliability, need for performance, and rebuild time, the latter of which can affect the likelihood that a second disk will fail while the first failed disk is being rebuilt.
- Other decisions include how many disks are involved in a RAID set and how many disks to protect with a single parity bit. More disks in the set increases performance but increases cost. Protecting more disks per parity bit saves cost, but increases the likelihood that a second disk will fail before the first bad disk is repaired.
10.7.5 Extensions
- RAID concepts have been extended to tape drives ( e.g. striping tapes for faster backups or parity checking tapes for reliability ), and for broadcasting of data.
10.7.6 Problems with RAID
- RAID protects against physical errors, but not against any number of bugs or other errors that could write erroneous data.
- ZFS adds an extra level of protection by including data block checksums in all inodes along with the pointers to the data blocks. If data are mirrored and one copy has the correct checksum and the other does not, then the data with the bad checksum will be replaced with a copy of the data with the good checksum. This increases reliability greatly over RAID alone, at a cost of a performance hit that is acceptable because ZFS is so fast to begin with.
Figure 10.13 - ZFS checksums all metadata and data.
- Another problem with traditional filesystems is that the sizes are fixed, and relatively difficult to change. Where RAID sets are involved it becomes even harder to adjust filesystem sizes, because a filesystem cannot span across multiple filesystems.
- ZFS solves these problems by pooling RAID sets, and by dynamically allocating space to filesystems as needed. Filesystem sizes can be limited by quotas, and space can also be reserved to guarantee that a filesystem will be able to grow later, but these parameters can be changed at any time by the filesystem's owner. Otherwise filesystems grow and shrink dynamically as needed.
Figure 10.14 - (a) Traditional volumes and file systems. (b) a ZFS pool and file systems.
10.8 Stable-Storage Implementation ( Optional )
- The concept of stable storage ( first presented in chapter 6 ) involves a storage medium in which data is never lost, even in the face of equipment failure in the middle of a write operation.
- To implement this requires two ( or more ) copies of the data, with separate failure modes.
- An attempted disk write results in one of three possible outcomes:
- The data is successfully and completely written.
- The data is partially written, but not completely. The last block written may be garbled.
- No writing takes place at all.
- Whenever an equipment failure occurs during a write, the system must detect it, and return the system back to a consistent state. To do this requires two physical blocks for every logical block, and the following procedure:
- Write the data to the first physical block.
- After step 1 had completed, then write the data to the second physical block.
- Declare the operation complete only after both physical writes have completed successfully.
- During recovery the pair of blocks is examined.
- If both blocks are identical and there is no sign of damage, then no further action is necessary.
- If one block contains a detectable error but the other does not, then the damaged block is replaced with the good copy. ( This will either undo the operation or complete the operation, depending on which block is damaged and which is undamaged. )
- If neither block shows damage but the data in the blocks differ, then replace the data in the first block with the data in the second block. ( Undo the operation. )
- Because the sequence of operations described above is slow, stable storage usually includes NVRAM as a cache, and declares a write operation complete once it has been written to the NVRAM.
10.9 Summary
Was 12.9 Tertiary-Storage Structure - Optional, Omitted from Ninth Edition
- Primary storage refers to computer memory chips; Secondary storage refers to fixed-disk storage systems ( hard drives ); And Tertiary Storage refers to removable media, such as tape drives, CDs, DVDs, and to a lesser extend floppies, thumb drives, and other detachable devices.
- Tertiary storage is typically characterized by large capacity, low cost per MB, and slow access times, although there are exceptions in any of these categories.
- Tertiary storage is typically used for backups and for long-term archival storage of completed work. Another common use for tertiary storage is to swap large little-used files ( or groups of files ) off of the hard drive, and then swap them back in as needed in a fashion similar to secondary storage providing swap space for primary storage. ( Review , note 5 ).
12.9.1 Tertiary-Storage Devices
12.9.1.1 Removable Disks
- Removable magnetic disks ( e.g. floppies ) can be nearly as fast as hard drives, but are at greater risk for damage due to scratches. Variations of removable magnetic disks up to a GB or more in capacity have been developed. ( Hot-swappable hard drives? )
- A magneto-optical disk uses a magnetic disk covered in a clear plastic coating that protects the surface.
- The heads sit a considerable distance away from the magnetic surface, and as a result do not have enough magnetic strength to switch bits at normal room temperature.
- For writing, a laser is used to heat up a specific spot on the disk, to a temperature at which the weak magnetic field of the write head is able to flip the bits.
- For reading, a laser is shined at the disk, and the Kerr effect causes the polarization of the light to become rotated either clockwise or counter-clockwise depending on the orientation of the magnetic field.
- Optical disks do not use magnetism at all, but instead use special materials that can be altered ( by lasers ) to have relatively light or dark spots.
- For example the phase-change disk has a material that can be frozen into either a crystalline or an amorphous state, the latter of which is less transparent and reflects less light when a laser is bounced off a reflective surface under the material.
- Three powers of lasers are used with phase-change disks: (1) a low power laser is used to read the disk, without effecting the materials. (2) A medium power erases the disk, by melting and re-freezing the medium into a crystalline state, and (3) a high power writes to the disk by melting the medium and re-freezing it into the amorphous state.
- The most common examples of these disks are re-writable CD-RWs and DVD-RWs.
- An alternative to the disks described above are Write-Once Read-Many, WORM drives.
- The original version of WORM drives involved a thin layer of aluminum sandwiched between two protective layers of glass or plastic.
- Holes were burned in the aluminum to write bits.
- Because the holes could not be filled back in, there was no way to re-write to the disk. ( Although data could be erased by burning more holes. )
- WORM drives have important legal ramifications for data that must be stored for a very long time and must be provable in court as unaltered since it was originally written. ( Such as long-term storage of medical records. )
- Modern CD-R and DVD-R disks are examples of WORM drives that use organic polymer inks instead of an aluminum layer.
- Read-only disks are similar to WORM disks, except the bits are pressed onto the disk at the factory, rather than being burned on one by one. ( for more information on CD manufacturing techniques. )
12.9.1.2 Tapes
- Tape drives typically cost more than disk drives, but the cost per MB of the tapes themselves is lower.
- Tapes are typically used today for backups, and for enormous volumes of data stored by certain scientific establishments. ( E.g. NASA's archive of space probe and satellite imagery, which is currently being downloaded from numerous sources faster than anyone can actually look at it. )
- Robotic tape changers move tapes from drives to archival tape libraries upon demand.
- ( Never underestimate the bandwidth of a station wagon full of tapes rolling down the highway! )
12.9.1.3 Future Technology
- Solid State Disks, SSDs, are becoming more and more popular.
- Holographic storage uses laser light to store images in a 3-D structure, and the entire data structure can be transferred in a single flash of laser light.
- Micro-Electronic Mechanical Systems, MEMS, employs the technology used for computer chip fabrication to create VERY tiny little machines. One example packs 10,000 read-write heads within a square centimeter of space, and as media are passed over it, all 10,000 heads can read data in parallel.
12.9.2 Operating-System Support
- The OS must provide support for tertiary storage as removable media, including the support to transfer data between different systems.
12.9.2.1 Application Interface
- File systems are typically not stored on tapes. ( It might be technically possible, but it is impractical. )
- Tapes are also not low-level formatted, and do not use fixed-length blocks. Rather data is written to tapes in variable length blocks as needed.
- Tapes are normally accessed as raw devices, requiring each application to determine how the data is to be stored and read back. Issues such as header contents and ASCII versus binary encoding ( and byte-ordering ) are generally application specific.
- Basic operations supported for tapes include locate( ), read( ), write( ), and read_position( ).
- ( Because of variable length writes ), writing to a tape erases all data that follows that point on the tape.
- Writing to a tape places the End of Tape ( EOT ) marker at the end of the data written.
- It is not possible to locate( ) to any spot past the EOT marker.
12.9.2.2 File Naming
- File naming conventions for removable media are not entirely uniquely specific, nor are they necessarily consistent between different systems. ( Two removable disks may contain files with the same name, and there is no clear way for the naming system to distinguish between them. )
- Fortunately music CDs have a common format, readable by all systems. Data CDs and DVDs have only a few format choices, making it easy for a system to support all known formats.
12.9.2.3 Hierarchical Storage Management
- Hierarchical storage involves extending file systems out onto tertiary storage, swapping files from hard drives to tapes in much the same manner as data blocks are swapped from memory to hard drives.
- A placeholder is generally left on the hard drive, storing information about the particular tape ( or other removable media ) on which the file has been swapped out to.
- A robotic system transfers data to and from tertiary storage as needed, generally automatically upon demand of the file(s) involved.
12.9.3 Performance Issues
12.9.3.1 Speed
- Sustained Bandwidth is the rate of data transfer during a large file transfer, once the proper tape is loaded and the file located.
- Effective Bandwidth is the effective overall rate of data transfer, including any overhead necessary to load the proper tape and find the file on the tape.
- Access Latency is all of the accumulated waiting time before a file can be actually read from tape. This includes the time it takes to find the file on the tape, the time to load the tape from the tape library, and the time spent waiting in the queue for the tape drive to become available.
- Clearly tertiary storage access is much slower than secondary access, although removable disks ( e.g. a CD jukebox ) have somewhat faster access than a tape library.
12.9.3.1 Reliability
- Fixed hard drives are generally more reliable than removable drives, because they are less susceptible to the environment.
- Optical disks are generally more reliable than magnetic media.
- A fixed hard drive crash can destroy all data, whereas an optical drive or tape drive failure will often not harm the data media, ( and certainly can't damage any media not in the drive at the time of the failure. )
- Tape drives are mechanical devices, and can wear out tapes over time, ( as the tape head is generally in much closer physical contact with the tape than disk heads are with platters. )
- Some drives may only be able to read tapes a few times whereas other drives may be able to re-use the same tapes millions of times.
- Backup tapes should be read after writing, to verify that the backup tape is readable. ( Unfortunately that may have been the LAST time that particular tape was readable, and the only way to be sure is to read it again, . . . )
- Long-term tape storage can cause degradation, as magnetic fields "drift" from one layer of tape to the adjacent layers. Periodic fast-forwarding and rewinding of tapes can help, by changing which section of tape lays against which other layers.
12.9.3.3 Cost
- The cost per megabyte for removable media is its strongest selling feature, particularly as the amount of storage involved ( i.e. the number of tapes, CDs, etc ) increases.
- However the cost per megabyte for hard drives has dropped more rapidly over the years than the cost of removable media, such that the currently most cost-effective backup solution for many systems is simply an additional ( external ) hard drive.
- ( One good use for old unwanted PCs is to put them on a network as a backup server and/or print server. The downside to this backup solution is that the backups are stored on-site with the original data, and a fire, flood, or burglary could wipe out both the original data and the backups. )
Old Figure 12.15 - Price per megabyte of DRAM, from 1981 to 2008
Old Figure 12.16 - Price per megabyte of magnetic hard disk, from 1981 to 2008.
Old Figure 12.17 - Price per megabyte of a tape drive, from 1984 to 2008.
First generation CD ROM discs, for example, held over 600 MB of data, hundreds of times the capacity of a floppy disk, which was the dominant form of removable media at the time. Even hard disks of the time typically had capacities in the tens on megabytes. Optical storage had a huge advantage in the quest for raw capacity. Of course, things changed over time. Even consumer-grade hard disks today can store multiple terabytes of data, and SD cards, USB flash drives and USB hard drives are now the preferred forms of removable storage. Most computers these days do not even have an optical drive. However, that may eventually change, thanks to holographic storage.
Holographic storage is one of those technologies that has existed seemingly forever, but never really took off. Like other forms of optical media, holographic storage devices use lasers to read and write data. However, the similarities between holographic storage and legacy optical storage technologies such as DVD or Blue Ray end there.
Holographic storage, at least as it currently exists, is a write once read many (WORM) medium. It was intended for the long-term storage of archival data. But it isn’t just the fact that it cannot be overwritten that makes holographic storage so well suited to the task of data archiving. It is also the fact that holographic storage offers tremendous capacity and blazingly fast read speeds--at least in theory.
First-generation devices failed to live up to the technology’s potential. A company called InPhase technologies, for example, created a removable holographic storage medium that was made commercially available. Although the storage medium itself was only about the size of a DVD RAM cartridge, the drive used by the medium was both huge and expensive. Worse yet, the holographic storage medium held only about 300 GB of data, and its maximum transfer rate was only about 20 MB per second.
Created an optical storage technology that could store 1 TB on a disc the size of a DVD, but the technology never took off.
Projected to be the size of a DVD, with a theoretical storage capacity of 360 TB. Furthermore, the medium is designed to avoid bit rot for billions of years. Although the technology has been said to perform well, performance benchmarks have not been made publicly available. Even so, performance is only of secondary concern. The medium’s primary goal is to create high capacity storage that is truly permanent--not necessarily fast.
Inside Holographic Storage
So how does holographic storage work? There have been several different holographic storage devices created over the years, each working in its own unique way. Most of these devices use multiple laser beams (or split beams) to allow data to be written to and read from optical storage three dimensionally.
The University of Southampton researchers have used a completely different approach. While their device does use lasers, it is said to store data in five dimensions. Of course, this does not refer to fifth dimensional storage. In physics, the fifth dimension refers to invariant properties of space-time. Instead, researchers at the University of South Hampton use the phrase “five dimensions” to refer to the five different characteristics that are used for the storage of data.
The first three dimensions of storage are exactly what you would probably expect them to be. These dimensions are essentially the X, Y and Z axis, or height, width, and depth. The fourth dimension refers in this case to the physical size of a “data dot” (in physics, the fourth dimension refers to the passage of time). The so-called fifth dimension is the data’s offset, or how the “data dot” is aligned on the media. Each of these properties can be interpreted as a value, thereby contributing to the medium’s massive storage capacity.
But what about the medium’s longevity? The reason why this particular form of holographic storage can be discussed in terms of permanence is because it is essentially made from rock. The five dimensions are made up of nano-structures that have been constructed from quartz crystal.
For right now, this form of holographic storage is not commercially available, but we expect that at some point it will be. In the meantime, the university has been permanently preserving the world’s greatest literary works in holographic storage.
Research Background:
Holographic memory offers the possibility of storing 1 terabyte (TB) of data in a sugar-cube-sized crystal. A terabyte of data equals 1,000 gigabytes, 1 million megabytes or 1 trillion Bytes. Data from more than 1,000 CDs could fit on a holographic memory system. Most computer Hard Drives only hold 10 to 40 GB of data, a small fraction of what a holographic memory system might hold.
Scientist Pieter J. van Heerden first proposed the idea of holographic (three-dimensional) storage in the early 1960s. A decade later, scientists at RCA Laboratories demonstrated the technology by recording 500 holograms in an iron-doped lithium-niobate crystal, and 550 holograms of high-resolution images in a light-sensitive polymer material. The lack of cheap parts and the advancement of magnetic and SemiConductor memories placed the development of holographic data storage on hold.
Content of the Problem
Prototypes developed by Lucent and IBM differ slightly, but most holographic data storage systems (HDSS) are based on the same concept. Here are the basic components that are needed to construct an HDSS:
- Blue-green argon Laser
- Beam splitters to spilt the laser beam
- Mirrors to direct the laser beams
- LCDpanel (spatial light modulator)
- Lenses to focus the laser beams
- Lithium-niobate crystal or photopolymer
- Charge-coupled device (CCD) camera
When the blue-green argon laser is fired, a beam splitter creates two beams. One beam, called the object or signal beam, will go straight, bounce off one mirror and travel through a spatial-light modulator (SLM). An SLM is a Liquid Crystal Display(LCD) that shows pages of raw binary data as clear and dark boxes. The information from the page of binary code is carried by the signal beam around to the light-sensitive lithium-niobate crystal. Some systems use a photopolymer in place of the crystal. A second beam, called the reference beam, shoots out the side of the beam splitter and takes a separate path to the crystal. When the two beams meet, the interference pattern that is created stores the data carried by the signal beam in a specific area in the crystal -- the data is stored as a hologram.
An advantage of a holographic memory system is that an entire page of data can be retrieved quickly and at one time. In order to retrieve and reconstruct the holographic page of data stored in the crystal, the reference beam is shined into the crystal at exactly the same angle at which it entered to store that page of data. Each page of data is stored in a different area of the crystal, based on the angle at which the reference beam strikes it. During reconstruction, the beam will be diffracted by the crystal to allow the recreation of the original page that was stored. This reconstructed page is then projected onto the charge-coupled device (CCD) camera, which interprets and forwards the digital information to a Computer
The key component of any holographic data storage system is the angle at which the second reference beam is fired at the crystal to retrieve a page of data. It must match the original reference beam angle exactly. A difference of just a thousandth of a millimeter will result in failure to retrieve that page of data.
- After more than 30 years of research and development, a desktop holographic storage system (HDSS) is close at hand. Early holographic data storage devices will have capacities of 125 GB and transfer rates of about 40 MB per second. Eventually, these devices could have storage capacities of 1 TB and data rates of more than 1 GB per second -- fast enough to transfer an entire DVD movie in 30 seconds. So why has it taken so long to develop an HDSS, and what is there left to do?
- When the idea of an HDSS was first proposed, the components for constructing such a device were much larger and more expensive. For example, a Laser for such a system in the 1960s would have been 6 feet long. Now, with the development of consumer electronics, a laser similar to those used in CD Players could be used for the HDSS. LCDs weren't even developed until 1968, and the first ones were very expensive. Today, LCDs are much cheaper and more complex than those developed 30 years ago. Additionally, a CCCD Sensor wasn't available until the last decade. Almost the entire HDSS device can now be made from off-the-shelf components, which means that it could be mass-produced.
- Although HDSS components are easier to come by today than they were in the 1960s, there are still some technical problems that need to be worked out. For example, if too many pages are stored in one crystal, the strength of each hologram is diminished. If there are too many holograms stored on a crystal, and the reference laser used to retrieve a hologram is not shined at the precise angle, a hologram will pick up a lot of background from the other holograms stored around it. It is also a challenge to align all of these components in a low-cost system.
In Conclusion
10.1 Overview of Mass-Storage Structure
10.1.1 Magnetic Disks
- Traditional magnetic disks have the following basic structure:
- One or more platters in the form of disks covered with magnetic media. Hard disk platters are made of rigid metal, while "floppy" disks are made of more flexible plastic.
- Each platter has two working surfaces. Older hard disk drives would sometimes not use the very top or bottom surface of a stack of platters, as these surfaces were more susceptible to potential damage.
- Each working surface is divided into a number of concentric rings called tracks. The collection of all tracks that are the same distance from the edge of the platter, ( i.e. all tracks immediately above one another in the following diagram ) is called a cylinder.
- Each track is further divided into sectors, traditionally containing 512 bytes of data each, although some modern disks occasionally use larger sector sizes. ( Sectors also include a header and a trailer, including checksum information among other things. Larger sector sizes reduce the fraction of the disk consumed by headers and trailers, but increase internal fragmentation and the amount of disk that must be marked bad in the case of errors. )
- The data on a hard drive is read by read-write heads. The standard configuration ( shown below ) uses one head per surface, each on a separate arm, and controlled by a common arm assembly which moves all heads simultaneously from one cylinder to another. ( Other configurations, including independent read-write heads, may speed up disk access, but involve serious technical difficulties. )
- The storage capacity of a traditional disk drive is equal to the number of heads ( i.e. the number of working surfaces ), times the number of tracks per surface, times the number of sectors per track, times the number of bytes per sector. A particular physical block of data is specified by providing the head-sector-cylinder number at which it is located.
Figure 10.1 - Moving-head disk mechanism.
- In operation the disk rotates at high speed, such as 7200 rpm ( 120 revolutions per second. ) The rate at which data can be transferred from the disk to the computer is composed of several steps:
- The positioning time, a.k.a. the seek time or random access time is the time required to move the heads from one cylinder to another, and for the heads to settle down after the move. This is typically the slowest step in the process and the predominant bottleneck to overall transfer rates.
- The rotational latency is the amount of time required for the desired sector to rotate around and come under the read-write head.This can range anywhere from zero to one full revolution, and on the average will equal one-half revolution. This is another physical step and is usually the second slowest step behind seek time. ( For a disk rotating at 7200 rpm, the average rotational latency would be 1/2 revolution / 120 revolutions per second, or just over 4 milliseconds, a long time by computer standards.
- The transfer rate, which is the time required to move the data electronically from the disk to the computer. ( Some authors may also use the term transfer rate to refer to the overall transfer rate, including seek time and rotational latency as well as the electronic data transfer rate. )
- Disk heads "fly" over the surface on a very thin cushion of air. If they should accidentally contact the disk, then a head crash occurs, which may or may not permanently damage the disk or even destroy it completely. For this reason it is normal to park the disk heads when turning a computer off, which means to move the heads off the disk or to an area of the disk where there is no data stored.
- Floppy disks are normally removable. Hard drives can also be removable, and some are even hot-swappable, meaning they can be removed while the computer is running, and a new hard drive inserted in their place.
- Disk drives are connected to the computer via a cable known as the I/O Bus. Some of the common interface formats include Enhanced Integrated Drive Electronics, EIDE; Advanced Technology Attachment, ATA; Serial ATA, SATA, Universal Serial Bus, USB; Fiber Channel, FC, and Small Computer Systems Interface, SCSI.
- The host controller is at the computer end of the I/O bus, and the disk controller is built into the disk itself. The CPU issues commands to the host controller via I/O ports. Data is transferred between the magnetic surface and onboard cache by the disk controller, and then the data is transferred from that cache to the host controller and the motherboard memory at electronic speeds.
10.1.2 Solid-State Disks - New
- As technologies improve and economics change, old technologies are often used in different ways. One example of this is the increasing used of solid state disks, or SSDs.
- SSDs use memory technology as a small fast hard disk. Specific implementations may use either flash memory or DRAM chips protected by a battery to sustain the information through power cycles.
- Because SSDs have no moving parts they are much faster than traditional hard drives, and certain problems such as the scheduling of disk accesses simply do not apply.
- However SSDs also have their weaknesses: They are more expensive than hard drives, generally not as large, and may have shorter life spans.
- SSDs are especially useful as a high-speed cache of hard-disk information that must be accessed quickly. One example is to store filesystem meta-data, e.g. directory and inode information, that must be accessed quickly and often. Another variation is a boot disk containing the OS and some application executables, but no vital user data. SSDs are also used in laptops to make them smaller, faster, and lighter.
- Because SSDs are so much faster than traditional hard disks, the throughput of the bus can become a limiting factor, causing some SSDs to be connected directly to the system PCI bus for example.
10.1.3 Magnetic Tapes - was 12.1.2
- Magnetic tapes were once used for common secondary storage before the days of hard disk drives, but today are used primarily for backups.
- Accessing a particular spot on a magnetic tape can be slow, but once reading or writing commences, access speeds are comparable to disk drives.
- Capacities of tape drives can range from 20 to 200 GB, and compression can double that capacity.
10.2 Disk Structure
- The traditional head-sector-cylinder, HSC numbers are mapped to linear block addresses by numbering the first sector on the first head on the outermost track as sector 0. Numbering proceeds with the rest of the sectors on that same track, and then the rest of the tracks on the same cylinder before proceeding through the rest of the cylinders to the center of the disk. In modern practice these linear block addresses are used in place of the HSC numbers for a variety of reasons:
- The linear length of tracks near the outer edge of the disk is much longer than for those tracks located near the center, and therefore it is possible to squeeze many more sectors onto outer tracks than onto inner ones.
- All disks have some bad sectors, and therefore disks maintain a few spare sectors that can be used in place of the bad ones. The mapping of spare sectors to bad sectors in managed internally to the disk controller.
- Modern hard drives can have thousands of cylinders, and hundreds of sectors per track on their outermost tracks. These numbers exceed the range of HSC numbers for many ( older ) operating systems, and therefore disks can be configured for any convenient combination of HSC values that falls within the total number of sectors physically on the drive.
- There is a limit to how closely packed individual bits can be placed on a physical media, but that limit is growing increasingly more packed as technological advances are made.
- Modern disks pack many more sectors into outer cylinders than inner ones, using one of two approaches:
- With Constant Linear Velocity, CLV, the density of bits is uniform from cylinder to cylinder. Because there are more sectors in outer cylinders, the disk spins slower when reading those cylinders, causing the rate of bits passing under the read-write head to remain constant. This is the approach used by modern CDs and DVDs.
- With Constant Angular Velocity, CAV, the disk rotates at a constant angular speed, with the bit density decreasing on outer cylinders. ( These disks would have a constant number of sectors per track on all cylinders. )
10.3 Disk Attachment
Disk drives can be attached either directly to a particular host ( a local disk ) or to a network.
10.3.1 Host-Attached Storage
- Local disks are accessed through I/O Ports as described earlier.
- The most common interfaces are IDE or ATA, each of which allow up to two drives per host controller.
- SATA is similar with simpler cabling.
- High end workstations or other systems in need of larger number of disks typically use SCSI disks:
- The SCSI standard supports up to 16 targets on each SCSI bus, one of which is generally the host adapter and the other 15 of which can be disk or tape drives.
- A SCSI target is usually a single drive, but the standard also supports up to 8 units within each target. These would generally be used for accessing individual disks within a RAID array. ( See below. )
- The SCSI standard also supports multiple host adapters in a single computer, i.e. multiple SCSI busses.
- Modern advancements in SCSI include "fast" and "wide" versions, as well as SCSI-2.
- SCSI cables may be either 50 or 68 conductors. SCSI devices may be external as well as internal.
- See wikipedia for more information on the SCSI interface.
- FC is a high-speed serial architecture that can operate over optical fiber or four-conductor copper wires, and has two variants:
- A large switched fabric having a 24-bit address space. This variant allows for multiple devices and multiple hosts to interconnect, forming the basis for the storage-area networks, SANs, to be discussed in a future section.
- The arbitrated loop, FC-AL, that can address up to 126 devices ( drives and controllers. )
10.3.2 Network-Attached Storage
- Network attached storage connects storage devices to computers using a remote procedure call, RPC, interface, typically with something like NFS filesystem mounts. This is convenient for allowing several computers in a group common access and naming conventions for shared storage.
- NAS can be implemented using SCSI cabling, or ISCSI uses Internet protocols and standard network connections, allowing long-distance remote access to shared files.
- NAS allows computers to easily share data storage, but tends to be less efficient than standard host-attached storage.
Figure 10.2 - Network-attached storage.
10.3.3 Storage-Area Network
- A Storage-Area Network, SAN, connects computers and storage devices in a network, using storage protocols instead of network protocols.
- One advantage of this is that storage access does not tie up regular networking bandwidth.
- SAN is very flexible and dynamic, allowing hosts and devices to attach and detach on the fly.
- SAN is also controllable, allowing restricted access to certain hosts and devices.
Figure 10.3 - Storage-area network.
10.4 Disk Scheduling
- As mentioned earlier, disk transfer speeds are limited primarily by seek times and rotational latency. When multiple requests are to be processed there is also some inherent delay in waiting for other requests to be processed.
- Bandwidth is measured by the amount of data transferred divided by the total amount of time from the first request being made to the last transfer being completed, ( for a series of disk requests. )
- Both bandwidth and access time can be improved by processing requests in a good order.
- Disk requests include the disk address, memory address, number of sectors to transfer, and whether the request is for reading or writing.
10.4.1 FCFS Scheduling
- First-Come First-Serve is simple and intrinsically fair, but not very efficient. Consider in the following sequence the wild swing from cylinder 122 to 14 and then back to 124:
Figure 10.4 - FCFS disk scheduling.
10.4.2 SSTF Scheduling
- Shortest Seek Time First scheduling is more efficient, but may lead to starvation if a constant stream of requests arrives for the same general area of the disk.
- SSTF reduces the total head movement to 236 cylinders, down from 640 required for the same set of requests under FCFS. Note, however that the distance could be reduced still further to 208 by starting with 37 and then 14 first before processing the rest of the requests.
Figure 10.5 - SSTF disk scheduling.
10.4.3 SCAN Scheduling
- The SCAN algorithm, a.k.a. the elevator algorithm moves back and forth from one end of the disk to the other, similarly to an elevator processing requests in a tall building.
Figure 10.6 - SCAN disk scheduling.
- Under the SCAN algorithm, If a request arrives just ahead of the moving head then it will be processed right away, but if it arrives just after the head has passed, then it will have to wait for the head to pass going the other way on the return trip. This leads to a fairly wide variation in access times which can be improved upon.
- Consider, for example, when the head reaches the high end of the disk: Requests with high cylinder numbers just missed the passing head, which means they are all fairly recent requests, whereas requests with low numbers may have been waiting for a much longer time. Making the return scan from high to low then ends up accessing recent requests first and making older requests wait that much longer.
10.4.4 C-SCAN Scheduling
- The Circular-SCAN algorithm improves upon SCAN by treating all requests in a circular queue fashion - Once the head reaches the end of the disk, it returns to the other end without processing any requests, and then starts again from the beginning of the disk:
Figure 10.7 - C-SCAN disk scheduling.
12.4.5 LOOK Scheduling
- LOOK scheduling improves upon SCAN by looking ahead at the queue of pending requests, and not moving the heads any farther towards the end of the disk than is necessary. The following diagram illustrates the circular form of LOOK:
Figure 10.8 - C-LOOK disk scheduling.
10.4.6 Selection of a Disk-Scheduling Algorithm
- With very low loads all algorithms are equal, since there will normally only be one request to process at a time.
- For slightly larger loads, SSTF offers better performance than FCFS, but may lead to starvation when loads become heavy enough.
- For busier systems, SCAN and LOOK algorithms eliminate starvation problems.
- The actual optimal algorithm may be something even more complex than those discussed here, but the incremental improvements are generally not worth the additional overhead.
- Some improvement to overall filesystem access times can be made by intelligent placement of directory and/or inode information. If those structures are placed in the middle of the disk instead of at the beginning of the disk, then the maximum distance from those structures to data blocks is reduced to only one-half of the disk size. If those structures can be further distributed and furthermore have their data blocks stored as close as possible to the corresponding directory structures, then that reduces still further the overall time to find the disk block numbers and then access the corresponding data blocks.
- On modern disks the rotational latency can be almost as significant as the seek time, however it is not within the OSes control to account for that, because modern disks do not reveal their internal sector mapping schemes, ( particularly when bad blocks have been remapped to spare sectors. )
- Some disk manufacturers provide for disk scheduling algorithms directly on their disk controllers, ( which do know the actual geometry of the disk as well as any remapping ), so that if a series of requests are sent from the computer to the controller then those requests can be processed in an optimal order.
- Unfortunately there are some considerations that the OS must take into account that are beyond the abilities of the on-board disk-scheduling algorithms, such as priorities of some requests over others, or the need to process certain requests in a particular order. For this reason OSes may elect to spoon-feed requests to the disk controller one at a time in certain situations.
10.5 Disk Management
105.1 Disk Formatting
- Before a disk can be used, it has to be low-level formatted, which means laying down all of the headers and trailers marking the beginning and ends of each sector. Included in the header and trailer are the linear sector numbers, and error-correcting codes, ECC, which allow damaged sectors to not only be detected, but in many cases for the damaged data to be recovered ( depending on the extent of the damage. ) Sector sizes are traditionally 512 bytes, but may be larger, particularly in larger drives.
- ECC calculation is performed with every disk read or write, and if damage is detected but the data is recoverable, then a soft error has occurred. Soft errors are generally handled by the on-board disk controller, and never seen by the OS. ( See below. )
- Once the disk is low-level formatted, the next step is to partition the drive into one or more separate partitions. This step must be completed even if the disk is to be used as a single large partition, so that the partition table can be written to the beginning of the disk.
- After partitioning, then the filesystems must be logically formatted, which involves laying down the master directory information ( FAT table or inode structure ), initializing free lists, and creating at least the root directory of the filesystem. ( Disk partitions which are to be used as raw devices are not logically formatted. This saves the overhead and disk space of the filesystem structure, but requires that the application program manage its own disk storage requirements. )
10.5.2 Boot Block
- Computer ROM contains a bootstrap program ( OS independent ) with just enough code to find the first sector on the first hard drive on the first controller, load that sector into memory, and transfer control over to it. ( The ROM bootstrap program may look in floppy and/or CD drives before accessing the hard drive, and is smart enough to recognize whether it has found valid boot code or not. )
- The first sector on the hard drive is known as the Master Boot Record, MBR, and contains a very small amount of code in addition to the partition table. The partition table documents how the disk is partitioned into logical disks, and indicates specifically which partition is the active or boot partition.
- The boot program then looks to the active partition to find an operating system, possibly loading up a slightly larger / more advanced boot program along the way.
- In a dual-boot ( or larger multi-boot ) system, the user may be given a choice of which operating system to boot, with a default action to be taken in the event of no response within some time frame.
- Once the kernel is found by the boot program, it is loaded into memory and then control is transferred over to the OS. The kernel will normally continue the boot process by initializing all important kernel data structures, launching important system services ( e.g. network daemons, sched, init, etc. ), and finally providing one or more login prompts. Boot options at this stage may include single-user a.k.a. maintenance or safe modes, in which very few system services are started - These modes are designed for system administrators to repair problems or otherwise maintain the system.
Figure 10.9 - Booting from disk in Windows 2000.
10.5.3 Bad Blocks
- No disk can be manufactured to 100% perfection, and all physical objects wear out over time. For these reasons all disks are shipped with a few bad blocks, and additional blocks can be expected to go bad slowly over time. If a large number of blocks go bad then the entire disk will need to be replaced, but a few here and there can be handled through other means.
- In the old days, bad blocks had to be checked for manually. Formatting of the disk or running certain disk-analysis tools would identify bad blocks, and attempt to read the data off of them one last time through repeated tries. Then the bad blocks would be mapped out and taken out of future service. Sometimes the data could be recovered, and sometimes it was lost forever. ( Disk analysis tools could be either destructive or non-destructive. )
- Modern disk controllers make much better use of the error-correcting codes, so that bad blocks can be detected earlier and the data usually recovered. ( Recall that blocks are tested with every write as well as with every read, so often errors can be detected before the write operation is complete, and the data simply written to a different sector instead. )
- Note that re-mapping of sectors from their normal linear progression can throw off the disk scheduling optimization of the OS, especially if the replacement sector is physically far away from the sector it is replacing. For this reason most disks normally keep a few spare sectors on each cylinder, as well as at least one spare cylinder. Whenever possible a bad sector will be mapped to another sector on the same cylinder, or at least a cylinder as close as possible. Sector slipping may also be performed, in which all sectors between the bad sector and the replacement sector are moved down by one, so that the linear progression of sector numbers can be maintained.
- If the data on a bad block cannot be recovered, then a hard error has occurred., which requires replacing the file(s) from backups, or rebuilding them from scratch.
10.6 Swap-Space Management
- Modern systems typically swap out pages as needed, rather than swapping out entire processes. Hence the swapping system is part of the virtual memory management system.
- Managing swap space is obviously an important task for modern OSes.
10.6.1 Swap-Space Use
- The amount of swap space needed by an OS varies greatly according to how it is used. Some systems require an amount equal to physical RAM; some want a multiple of that; some want an amount equal to the amount by which virtual memory exceeds physical RAM, and some systems use little or none at all!
- Some systems support multiple swap spaces on separate disks in order to speed up the virtual memory system.
10.6.2 Swap-Space Location
Swap space can be physically located in one of two locations:
- As a large file which is part of the regular filesystem. This is easy to implement, but inefficient. Not only must the swap space be accessed through the directory system, the file is also subject to fragmentation issues. Caching the block location helps in finding the physical blocks, but that is not a complete fix.
- As a raw partition, possibly on a separate or little-used disk. This allows the OS more control over swap space management, which is usually faster and more efficient. Fragmentation of swap space is generally not a big issue, as the space is re-initialized every time the system is rebooted. The downside of keeping swap space on a raw partition is that it can only be grown by repartitioning the hard drive.
12.6.3 Swap-Space Management: An Example
- Historically OSes swapped out entire processes as needed. Modern systems swap out only individual pages, and only as needed. ( For example process code blocks and other blocks that have not been changed since they were originally loaded are normally just freed from the virtual memory system rather than copying them to swap space, because it is faster to go find them again in the filesystem and read them back in from there than to write them out to swap space and then read them back. )
- In the mapping system shown below for Linux systems, a map of swap space is kept in memory, where each entry corresponds to a 4K block in the swap space. Zeros indicate free slots and non-zeros refer to how many processes have a mapping to that particular block ( >1 for shared pages only. )
Figure 10.10 - The data structures for swapping on Linux systems.
10.7 RAID Structure
- The general idea behind RAID is to employ a group of hard drives together with some form of duplication, either to increase reliability or to speed up operations, ( or sometimes both. )
- RAID originally stood for Redundant Array of Inexpensive Disks, and was designed to use a bunch of cheap small disks in place of one or two larger more expensive ones. Today RAID systems employ large possibly expensive disks as their components, switching the definition to Independent disks.
10.7.1 Improvement of Reliability via Redundancy
- The more disks a system has, the greater the likelihood that one of them will go bad at any given time. Hence increasing disks on a system actually decreases the Mean Time To Failure, MTTF of the system.
- If, however, the same data was copied onto multiple disks, then the data would not be lost unless both ( or all ) copies of the data were damaged simultaneously, which is a MUCH lower probability than for a single disk going bad. More specifically, the second disk would have to go bad before the first disk was repaired, which brings the Mean Time To Repair into play. For example if two disks were involved, each with a MTTF of 100,000 hours and a MTTR of 10 hours, then the Mean Time to Data Loss would be 500 * 10^6 hours, or 57,000 years!
- This is the basic idea behind disk mirroring, in which a system contains identical data on two or more disks.
- Note that a power failure during a write operation could cause both disks to contain corrupt data, if both disks were writing simultaneously at the time of the power failure. One solution is to write to the two disks in series, so that they will not both become corrupted ( at least not in the same way ) by a power failure. And alternate solution involves non-volatile RAM as a write cache, which is not lost in the event of a power failure and which is protected by error-correcting codes.
10.7.2 Improvement in Performance via Parallelism
- There is also a performance benefit to mirroring, particularly with respect to reads. Since every block of data is duplicated on multiple disks, read operations can be satisfied from any available copy, and multiple disks can be reading different data blocks simultaneously in parallel. ( Writes could possibly be sped up as well through careful scheduling algorithms, but it would be complicated in practice. )
- Another way of improving disk access time is with striping, which basically means spreading data out across multiple disks that can be accessed simultaneously.
- With bit-level striping the bits of each byte are striped across multiple disks. For example if 8 disks were involved, then each 8-bit byte would be read in parallel by 8 heads on separate disks. A single disk read would access 8 * 512 bytes = 4K worth of data in the time normally required to read 512 bytes. Similarly if 4 disks were involved, then two bits of each byte could be stored on each disk, for 2K worth of disk access per read or write operation.
- Block-level striping spreads a filesystem across multiple disks on a block-by-block basis, so if block N were located on disk 0, then block N + 1 would be on disk 1, and so on. This is particularly useful when filesystems are accessed in clusters of physical blocks. Other striping possibilities exist, with block-level striping being the most common.
10.7.3 RAID Levels
- Mirroring provides reliability but is expensive; Striping improves performance, but does not improve reliability. Accordingly there are a number of different schemes that combine the principals of mirroring and striping in different ways, in order to balance reliability versus performance versus cost. These are described by different RAID levels, as follows: ( In the diagram that follows, "C" indicates a copy, and "P" indicates parity, i.e. checksum bits. )
- Raid Level 0 - This level includes striping only, with no mirroring.
- Raid Level 1 - This level includes mirroring only, no striping.
- Raid Level 2 - This level stores error-correcting codes on additional disks, allowing for any damaged data to be reconstructed by subtraction from the remaining undamaged data. Note that this scheme requires only three extra disks to protect 4 disks worth of data, as opposed to full mirroring. ( The number of disks required is a function of the error-correcting algorithms, and the means by which the particular bad bit(s) is(are) identified. )
- Raid Level 3 - This level is similar to level 2, except that it takes advantage of the fact that each disk is still doing its own error-detection, so that when an error occurs, there is no question about which disk in the array has the bad data. As a result a single parity bit is all that is needed to recover the lost data from an array of disks. Level 3 also includes striping, which improves performance. The downside with the parity approach is that every disk must take part in every disk access, and the parity bits must be constantly calculated and checked, reducing performance. Hardware-level parity calculations and NVRAM cache can help with both of those issues. In practice level 3 is greatly preferred over level 2.
- Raid Level 4 - This level is similar to level 3, employing block-level striping instead of bit-level striping. The benefits are that multiple blocks can be read independently, and changes to a block only require writing two blocks ( data and parity ) rather than involving all disks. Note that new disks can be added seamlessly to the system provided they are initialized to all zeros, as this does not affect the parity results.
- Raid Level 5 - This level is similar to level 4, except the parity blocks are distributed over all disks, thereby more evenly balancing the load on the system. For any given block on the disk(s), one of the disks will hold the parity information for that block and the other N-1 disks will hold the data. Note that the same disk cannot hold both data and parity for the same block, as both would be lost in the event of a disk crash.
- Raid Level 6 - This level extends raid level 5 by storing multiple bits of error-recovery codes, ( such as the Reed-Solomon codes ), for each bit position of data, rather than a single parity bit. In the example shown below 2 bits of ECC are stored for every 4 bits of data, allowing data recovery in the face of up to two simultaneous disk failures. Note that this still involves only 50% increase in storage needs, as opposed to 100% for simple mirroring which could only tolerate a single disk failure.
Figure 10.11 - RAID levels.
- There are also two RAID levels which combine RAID levels 0 and 1 ( striping and mirroring ) in different combinations, designed to provide both performance and reliability at the expense of increased cost.
- RAID level 0 + 1 disks are first striped, and then the striped disks mirrored to another set. This level generally provides better performance than RAID level 5.
- RAID level 1 + 0 mirrors disks in pairs, and then stripes the mirrored pairs. The storage capacity, performance, etc. are all the same, but there is an advantage to this approach in the event of multiple disk failures, as illustrated below:.
- In diagram (a) below, the 8 disks have been divided into two sets of four, each of which is striped, and then one stripe set is used to mirror the other set.
- If a single disk fails, it wipes out the entire stripe set, but the system can keep on functioning using the remaining set.
- However if a second disk from the other stripe set now fails, then the entire system is lost, as a result of two disk failures.
- In diagram (b), the same 8 disks are divided into four sets of two, each of which is mirrored, and then the file system is striped across the four sets of mirrored disks.
- If a single disk fails, then that mirror set is reduced to a single disk, but the system rolls on, and the other three mirror sets continue mirroring.
- Now if a second disk fails, ( that is not the mirror of the already failed disk ), then another one of the mirror sets is reduced to a single disk, but the system can continue without data loss.
- In fact the second arrangement could handle as many as four simultaneously failed disks, as long as no two of them were from the same mirror pair.
Figure 10.12 - RAID 0 + 1 and 1 + 0
10.7.4 Selecting a RAID Level
- Trade-offs in selecting the optimal RAID level for a particular application include cost, volume of data, need for reliability, need for performance, and rebuild time, the latter of which can affect the likelihood that a second disk will fail while the first failed disk is being rebuilt.
- Other decisions include how many disks are involved in a RAID set and how many disks to protect with a single parity bit. More disks in the set increases performance but increases cost. Protecting more disks per parity bit saves cost, but increases the likelihood that a second disk will fail before the first bad disk is repaired.
10.7.5 Extensions
- RAID concepts have been extended to tape drives ( e.g. striping tapes for faster backups or parity checking tapes for reliability ), and for broadcasting of data.
10.7.6 Problems with RAID
- RAID protects against physical errors, but not against any number of bugs or other errors that could write erroneous data.
- ZFS adds an extra level of protection by including data block checksums in all inodes along with the pointers to the data blocks. If data are mirrored and one copy has the correct checksum and the other does not, then the data with the bad checksum will be replaced with a copy of the data with the good checksum. This increases reliability greatly over RAID alone, at a cost of a performance hit that is acceptable because ZFS is so fast to begin with.
Figure 10.13 - ZFS checksums all metadata and data.
- Another problem with traditional filesystems is that the sizes are fixed, and relatively difficult to change. Where RAID sets are involved it becomes even harder to adjust filesystem sizes, because a filesystem cannot span across multiple filesystems.
- ZFS solves these problems by pooling RAID sets, and by dynamically allocating space to filesystems as needed. Filesystem sizes can be limited by quotas, and space can also be reserved to guarantee that a filesystem will be able to grow later, but these parameters can be changed at any time by the filesystem's owner. Otherwise filesystems grow and shrink dynamically as needed.
Figure 10.14 - (a) Traditional volumes and file systems. (b) a ZFS pool and file systems.
10.8 Stable-Storage Implementation ( Optional )
- The concept of stable storage ( first presented in chapter 6 ) involves a storage medium in which data is never lost, even in the face of equipment failure in the middle of a write operation.
- To implement this requires two ( or more ) copies of the data, with separate failure modes.
- An attempted disk write results in one of three possible outcomes:
- The data is successfully and completely written.
- The data is partially written, but not completely. The last block written may be garbled.
- No writing takes place at all.
- Whenever an equipment failure occurs during a write, the system must detect it, and return the system back to a consistent state. To do this requires two physical blocks for every logical block, and the following procedure:
- Write the data to the first physical block.
- After step 1 had completed, then write the data to the second physical block.
- Declare the operation complete only after both physical writes have completed successfully.
- During recovery the pair of blocks is examined.
- If both blocks are identical and there is no sign of damage, then no further action is necessary.
- If one block contains a detectable error but the other does not, then the damaged block is replaced with the good copy. ( This will either undo the operation or complete the operation, depending on which block is damaged and which is undamaged. )
- If neither block shows damage but the data in the blocks differ, then replace the data in the first block with the data in the second block. ( Undo the operation. )
- Because the sequence of operations described above is slow, stable storage usually includes NVRAM as a cache, and declares a write operation complete once it has been written to the NVRAM.
10.9 Summary
Was 12.9 Tertiary-Storage Structure - Optional, Omitted from Ninth Edition
- Primary storage refers to computer memory chips; Secondary storage refers to fixed-disk storage systems ( hard drives ); And Tertiary Storage refers to removable media, such as tape drives, CDs, DVDs, and to a lesser extend floppies, thumb drives, and other detachable devices.
- Tertiary storage is typically characterized by large capacity, low cost per MB, and slow access times, although there are exceptions in any of these categories.
- Tertiary storage is typically used for backups and for long-term archival storage of completed work. Another common use for tertiary storage is to swap large little-used files ( or groups of files ) off of the hard drive, and then swap them back in as needed in a fashion similar to secondary storage providing swap space for primary storage. ( Review , note 5 ).
12.9.1 Tertiary-Storage Devices
12.9.1.1 Removable Disks
- Removable magnetic disks ( e.g. floppies ) can be nearly as fast as hard drives, but are at greater risk for damage due to scratches. Variations of removable magnetic disks up to a GB or more in capacity have been developed. ( Hot-swappable hard drives? )
- A magneto-optical disk uses a magnetic disk covered in a clear plastic coating that protects the surface.
- The heads sit a considerable distance away from the magnetic surface, and as a result do not have enough magnetic strength to switch bits at normal room temperature.
- For writing, a laser is used to heat up a specific spot on the disk, to a temperature at which the weak magnetic field of the write head is able to flip the bits.
- For reading, a laser is shined at the disk, and the Kerr effect causes the polarization of the light to become rotated either clockwise or counter-clockwise depending on the orientation of the magnetic field.
- Optical disks do not use magnetism at all, but instead use special materials that can be altered ( by lasers ) to have relatively light or dark spots.
- For example the phase-change disk has a material that can be frozen into either a crystalline or an amorphous state, the latter of which is less transparent and reflects less light when a laser is bounced off a reflective surface under the material.
- Three powers of lasers are used with phase-change disks: (1) a low power laser is used to read the disk, without effecting the materials. (2) A medium power erases the disk, by melting and re-freezing the medium into a crystalline state, and (3) a high power writes to the disk by melting the medium and re-freezing it into the amorphous state.
- The most common examples of these disks are re-writable CD-RWs and DVD-RWs.
- An alternative to the disks described above are Write-Once Read-Many, WORM drives.
- The original version of WORM drives involved a thin layer of aluminum sandwiched between two protective layers of glass or plastic.
- Holes were burned in the aluminum to write bits.
- Because the holes could not be filled back in, there was no way to re-write to the disk. ( Although data could be erased by burning more holes. )
- WORM drives have important legal ramifications for data that must be stored for a very long time and must be provable in court as unaltered since it was originally written. ( Such as long-term storage of medical records. )
- Modern CD-R and DVD-R disks are examples of WORM drives that use organic polymer inks instead of an aluminum layer.
- Read-only disks are similar to WORM disks, except the bits are pressed onto the disk at the factory, rather than being burned on one by one. ( for more information on CD manufacturing techniques. )
12.9.1.2 Tapes
- Tape drives typically cost more than disk drives, but the cost per MB of the tapes themselves is lower.
- Tapes are typically used today for backups, and for enormous volumes of data stored by certain scientific establishments. ( E.g. NASA's archive of space probe and satellite imagery, which is currently being downloaded from numerous sources faster than anyone can actually look at it. )
- Robotic tape changers move tapes from drives to archival tape libraries upon demand.
- ( Never underestimate the bandwidth of a station wagon full of tapes rolling down the highway! )
12.9.1.3 Future Technology
- Solid State Disks, SSDs, are becoming more and more popular.
- Holographic storage uses laser light to store images in a 3-D structure, and the entire data structure can be transferred in a single flash of laser light.
- Micro-Electronic Mechanical Systems, MEMS, employs the technology used for computer chip fabrication to create VERY tiny little machines. One example packs 10,000 read-write heads within a square centimeter of space, and as media are passed over it, all 10,000 heads can read data in parallel.
12.9.2 Operating-System Support
- The OS must provide support for tertiary storage as removable media, including the support to transfer data between different systems.
12.9.2.1 Application Interface
- File systems are typically not stored on tapes. ( It might be technically possible, but it is impractical. )
- Tapes are also not low-level formatted, and do not use fixed-length blocks. Rather data is written to tapes in variable length blocks as needed.
- Tapes are normally accessed as raw devices, requiring each application to determine how the data is to be stored and read back. Issues such as header contents and ASCII versus binary encoding ( and byte-ordering ) are generally application specific.
- Basic operations supported for tapes include locate( ), read( ), write( ), and read_position( ).
- ( Because of variable length writes ), writing to a tape erases all data that follows that point on the tape.
- Writing to a tape places the End of Tape ( EOT ) marker at the end of the data written.
- It is not possible to locate( ) to any spot past the EOT marker.
12.9.2.2 File Naming
- File naming conventions for removable media are not entirely uniquely specific, nor are they necessarily consistent between different systems. ( Two removable disks may contain files with the same name, and there is no clear way for the naming system to distinguish between them. )
- Fortunately music CDs have a common format, readable by all systems. Data CDs and DVDs have only a few format choices, making it easy for a system to support all known formats.
12.9.2.3 Hierarchical Storage Management
- Hierarchical storage involves extending file systems out onto tertiary storage, swapping files from hard drives to tapes in much the same manner as data blocks are swapped from memory to hard drives.
- A placeholder is generally left on the hard drive, storing information about the particular tape ( or other removable media ) on which the file has been swapped out to.
- A robotic system transfers data to and from tertiary storage as needed, generally automatically upon demand of the file(s) involved.
12.9.3 Performance Issues
12.9.3.1 Speed
- Sustained Bandwidth is the rate of data transfer during a large file transfer, once the proper tape is loaded and the file located.
- Effective Bandwidth is the effective overall rate of data transfer, including any overhead necessary to load the proper tape and find the file on the tape.
- Access Latency is all of the accumulated waiting time before a file can be actually read from tape. This includes the time it takes to find the file on the tape, the time to load the tape from the tape library, and the time spent waiting in the queue for the tape drive to become available.
- Clearly tertiary storage access is much slower than secondary access, although removable disks ( e.g. a CD jukebox ) have somewhat faster access than a tape library.
12.9.3.1 Reliability
- Fixed hard drives are generally more reliable than removable drives, because they are less susceptible to the environment.
- Optical disks are generally more reliable than magnetic media.
- A fixed hard drive crash can destroy all data, whereas an optical drive or tape drive failure will often not harm the data media, ( and certainly can't damage any media not in the drive at the time of the failure. )
- Tape drives are mechanical devices, and can wear out tapes over time, ( as the tape head is generally in much closer physical contact with the tape than disk heads are with platters. )
- Some drives may only be able to read tapes a few times whereas other drives may be able to re-use the same tapes millions of times.
- Backup tapes should be read after writing, to verify that the backup tape is readable. ( Unfortunately that may have been the LAST time that particular tape was readable, and the only way to be sure is to read it again, . . . )
- Long-term tape storage can cause degradation, as magnetic fields "drift" from one layer of tape to the adjacent layers. Periodic fast-forwarding and rewinding of tapes can help, by changing which section of tape lays against which other layers.
12.9.3.3 Cost
- The cost per megabyte for removable media is its strongest selling feature, particularly as the amount of storage involved ( i.e. the number of tapes, CDs, etc ) increases.
- However the cost per megabyte for hard drives has dropped more rapidly over the years than the cost of removable media, such that the currently most cost-effective backup solution for many systems is simply an additional ( external ) hard drive.
- ( One good use for old unwanted PCs is to put them on a network as a backup server and/or print server. The downside to this backup solution is that the backups are stored on-site with the original data, and a fire, flood, or burglary could wipe out both the original data and the backups. )
Old Figure 12.15 - Price per megabyte of DRAM, from 1981 to 2008
Old Figure 12.16 - Price per megabyte of magnetic hard disk, from 1981 to 2008.
Old Figure 12.17 - Price per megabyte of a tape drive, from 1984 to 2008.
8.2 AES fundamentals
The AES(Application Environment Services forms the highest level of GEM. It deals with all those parts of GEM that go above elementary graphic output and input functions. As the AES works exclusively with VDI and GEMDOS calls, it is completely independent of the graphic hardware, of input devices as well as of file-systems.
The AES manages two types of user programs: Normal GEM applications with file extensions '.PRG', '.APP' or '.GTP', and desk accessories with file extensions '.ACC'.
Unless you are using a multitasking operating system such as MagiC, MiNT or MultiTOS, the AES can only run one application and six desk accessories at a time. Desk accessories (with an '.ACC' extension) allow quasi-multitasking even with plain TOS: They are usually special GEM programs loaded (wholly or partially) at boot-up from the root directory of the boot drive (normally C:\), which remain in memory and can be called at any time from GEM (and some TOS) programs by clicking on their entry in the first drop/pulldown menu. In other words, desk accessories can be called and used while another application is running and has its window(s) open, even with a single-tasking operating system such as TOS. Note that this is not real multi-tasking, as the main application is suspended while the accessory is executing and only resumes when the accessory is exited.
Unlike applications, desk accessories don't interact with the user immediately; most just initialize themselves and enter a message loop awaiting an AC_OPEN message. Some wait for timer events or special messages from other applications. Once triggered, they usually open a window where a user may interact with them. Under TOS, accessories should not use a menu bar and should never exit after a menu_register call. Loading of any resources should happen before the accessory calls menu_register, and these resources should be embedded in the desk accessory rather than being post-loaded, as on TOS versions earlier than 2.06 memory allocated to a desk accessory is not freed at a resolution change; thus memory allocated with rsrc_load is lost to the system after a change of resolution with the early TOS's.
When a desk accessory is closed under TOS, it is sent an AC_CLOSE message by the system. Following this it should perform any required cleanups to release sytem resources and close files opened at AC_OPEN (the AES closes the accessory's windows automatically). Following this it should reenter the event loop and wait for a later AC_OPEN message.
The following points are covered in this section:
- Accessories
- Bindings of the AES
- The desktop window
- Data exchange via the GEM clipboard
- Messages
- AES object structure
- Quarter-screen buffer
- Rectangle-list of a window
- Screen-manager
- Toolbar support
For the AES too there have been some interesting developments, as various programmers have meanwhile announced their own AES clones; at present one can mention projects such as N.AES and XaAES. Besides constant evolution one may hope also for source texts of these GEM components.
See also: Style guidelines
8.2.1 Accessories
8.2.1.1 Startup-code for accessories
To test whether an application was launched as a program or as a desk accessory, one can proceed as follows:
- If the register a0 has the value zero at program startup, then we are dealing with a normal program launch.
- Otherwise we are dealing with a desk accessory, and register a0 contains a pointer to the (incompletely) filled BASEPAGE. The TPA has already been shrunk appropriately (to the sum of basepage size and the length of the three program segments), but a stack still has to be created.
Note: With this information there is no problem in creating the start- up code for a program in such a way that it recognizes automatically how the application was launched, and to continue the initialization appropriately. With most C compilers the external variable _app in the startup code is initialized automatically, which has the value 0 when the application was launched as a desk accessory. This makes it possible to develop applications so that they may be launched either as desk accessories or as normal programs.
See also:
About the AES Accessories in MagiC Program launch and TPA
8.2.1.2 Accessories in MagiC
Under MagiC, desk accessories are almost equal to programs. Their windows are maintained at program changes. They may have menus and desktop backgrounds, post-load programs, allocate memory, open/close/ delete/copy files etc.
As there is no longer any reason to close windows at program changes, there is no AC_CLOSE message any more. The system does not distinguish desk accessories from programs, apart from the fact that they may not terminate themselves. As under GEM/2, accessories can also deregister themselves in the menu, using the AES call menu_unregister.
In place of accessories, under MagiC it is more sensible to use applications that simply register one menu bar with one menu, and lie in the APP autostart folder. These applications can be loaded when required, and also removed again afterwards.
Note: As of MagiC 4, desk accessories can be loaded also while the system is running (not just at boot-up). Furthermore accessories can be unloaded by clicking on the corresponding accessory entry in the first menu while the [Control] key is held down. One disadvantage is that at present accessories may not perform Pexec with mode 104.
See also:
About the AES GEM Startup-code for accessories shel_write
8.2.2 The desktop window
Of the available windows, the desktop or background window plays a special role. It has the ID 0, occupies the whole screen area, is always open and also cannot be closed. The working area is the area below the menu bar. Only in this working area can other programs output to the screen or open their own windows.
Normally the working area of the desktop appears as a solid green area (in colour resolutions) or as a grey raster pattern (in monochrome operation). The screen-manager attends to the screen redraws all on its own; with a call of wind_set, other application programs can anchor any other object tree as a background. In that case too the screen-manager looks after any required redraws of sections of the image. Although this possibility is very alluring, there are several reasons that point against the use of the desktop window; the most important:
- Even under a multitasking-capable GEM (MagiC or MultiTOS), there can be only one screen background. This should be reserved for the program that can make the most use of it - as a rule this is the desktop or a desktop replacement such as the Gemini shell, Thing or Jinnee for instance.
To sum up: If possible, the desktop background should not be used in your own programs.
See also: About the AES wind_set WF_NEWDESK
8.2.3 Data exchange via the GEM clipboard
To store files in the clipboard, one should proceed as follows:
- Delete all clipboard files that match the mask 'scrap.*' and 'SCRAP.*'. Note: The mask 'SCRAP.*' must be allowed for because old programs knew nothing of alternative and case-sensitive file-systems.
- Save the data to be stored in one or several formats.
- Send the message SC_CHANGED to all applications in the system and SH_WDRAW to the system shell.
The filename is always 'scrap.', the extension (suffix) depends on the selected format here; if possible one should always support one of the following standard formats:
Suffix |
Meaning |
gem |
Vector graphics in metafile format |
img |
Pixel images in XIMG format |
txt |
ASCII text file, each line terminated with CR/LF |
In addition one can support one or more of the following formats (the receiver then has the option of using the option with the greatest amount of information):
Suffix |
Meaning |
asc |
ASCII text file, each paragraph terminated with CR/LF |
csv |
ASCII file with comma-separated numbers |
cvg |
Calamus vector graphic format |
dif |
Export file of spreadsheets |
eps |
Encapsulated PostScript |
1wp |
Wordplus format |
rtf |
Microsoft Rich Text Format |
tex |
TeX |
The receiving program should first check which of the available files contains the most information, and then use this file.
Important: Each of the files in the clipboard contains the same information on principle, just in different formats. The text processor Papyrus, for instance, imports 'scrap.rtf' only if its own format 'scrap.pap' could not be found.
It should be clear from the above explanation that only one data object (though in different formats) can be present in the clipboard at any time.
Note: A few old programs, such as First Word and First Word Plus, are promiscuous and the clipoards they create automatically are scattered all over the place - usually the directory they are launched from. Some other applications may then use this clipboard rather than the 'official' one on the boot drive!
See also: Clipboard functions scrp_clear Style guidelines
8.2.4 The object structure
Although the data structure of the object tree is not a tree in the sense of a binary tree, it nevertheless possesses within a pointer the logical chaining of a tree, with predecessors and successors (generally called 'parents' and 'children' respectively). The speci fication of parents and children is made via indices to an array.
The tree structure of the individual objects can be illustrated best with a simple example: A menu is composed at first of the menu bar. This in turn contains several title texts. The title texts therefore are contained directly in the menu bar, and are both children of the object 'menu bar', so they move on the same hierarchical level. The object menu bar refers with ob_head to the first menu title and with ob_tail to the last menu title. In the first menu title the pointer ob_next serves for addressing the next menu title. Thus the chaining shows the following structure:
Menu bar:
+---------+---------+--------+
| ob_head | ob_tail | ... |
| o | o | |
+----|----+----|----+--------+
| +-------------------------+
V V
+---------+---------+--------+ +---------+---------+--------+
| ... | ob_next | ... | ... | ... | ... | ... |
| | o | | | | | |
+---------+----|----+--------+ +---------+---------+--------+
1st menu title | n-th menu title
+-----> 2nd menu title
The actions that may be performed with a given object is specified in ob_flags. The state of an object is held in the entry ob_state. The entry ob_type determines the object type.
For an exact definition some objects need an additional data structure such as TEDINFO or BITBLK. In that case a pointer to this additional structure will be stored in ob_spec.
Summarising again the total setup of the data structure for objects OBJECT:
+-------------+
| ob_next | Index for the next object
+-------------+
| ob_head | Index of the first child
+-------------+
| ob_tail | Index of the last child
+-------------+
| ob_type | Object type
+-------------+
| ob_flags | Manipulation flags
+-------------+
| ob_state | Object status
+-------------+
| ob_spec | See under object type
+-------------+
| ob_x | Relative X-coordiante to parent object
+-------------+
| ob_y | Relative Y-coordinate to parent object
+-------------+
| ob_width | Width of the object
+-------------+
| ob_height | Height of the object
+-------------+
See also:
AES object colours Object types Manipulation flags Object status
8.2.4.1 AES object types
The following types of object are available for selection:
Type |
Meaning |
||||||
G_BOX (20) |
Rectangular box with optional border; ob_spec contains sundry information about border width, colour and similar matters |
||||||
G_TEXT (21) |
Formatted graphic text; ob_spec points to a TEDINFO structure |
||||||
G_BOXTEXT (22) |
Rectangular box with formatted graphic text; ob_specpoints to a TEDINFO structure |
||||||
G_IMAGE (23) |
Monochrome image;ob_specpoints to BITBLK structure |
||||||
G_USERDEF (24) |
User-defined function for drawing a customized object; ob_spec points to a USERBLK structure. (Note: In some libraries this is called G_PROGDEF for a programmer-defined function) |
||||||
G_IBOX (25) |
Transparent rectangle that can only be seen if the optional border does not have zero width; ob_spec contains futher information about the appearance |
||||||
G_BUTTON (26) |
Push-button text with border for option selection; ob_spec points to a text string with the text that is to appear in the button New as of MagiC Version 3.0: If the object flag WHITEBAK is set, and bit 15 in object status = 0, then the button will contain an underscored character; for this, (high byte & 0xf) of ob_state gives the desired position of the underscore (with a suitable library one can make the underscored character when pressed together with the [Alternate] key select the button in the dialog of a running application) On the other hand if bit 15 = 1 then we are dealing with a special button (radio-button or checkbox) Further specialties: WHITEBAK = 1, bit 15 = 1 and in ob_state:
(Here again (high byte & 0xf) of ob_spec is the underscore position). The presence of these features is best established via the function appl_getinfo (opcode 13). |
||||||
G_BOXCHAR (27) |
Rectangle containing a character; in ob_spec both the appearance of the border and the character are defined |
||||||
G_STRING (28) |
Character string; ob_spec points to the string New as of MagiC Version 3.0: If the object flag WHITEBAK is set, and the high-byte of ob_state != -1, then a character of the string will be underscored; the underscore position is determined by (high byte & 0xf) of ob_state With WHITEBAK flag set and high byte ofob_state= -1 the complete string will be underscored. The presence of these features is best established via the function appl_getinfo (opcode 13). |
||||||
G_FTEXT (29) |
Editable formatted graphic text; ob_spec points to a TEDINFO structure |
||||||
G_FBOXTEXT (30) |
Rectangle with editable formatted graphic text; ob_spec points to a TEDINFO structure |
||||||
G_ICON (31) |
Monochrome icon symbol with mask; ob_spec points to the ICONBLK structure |
||||||
G_TITLE (32) |
Title of a drop-down menu; ob_spec points to the string. As of MagiC 2 one can also underscore one of the characters. This is done as follows: Set WHITEBAK in ob_state
|
||||||
G_CICON (33) |
Colour icon (available as of AES V3.3); ob_spec points to the CICONBLK structure |
||||||
G_CLRICN (33) |
Colour icon; ob_spec is a pointer to an ICONBLK structure. Supported in the ViewMAX/3 beta and in FreeGEM. |
||||||
G_SWBUTTON (34) |
Cycle button (i.e. a button which alters its text cyclically when clicked on); ob_spec points to a SWINFO structure. The presence of this object type should be inquired with appl_getinfo (opcode 13). |
||||||
G_DTMFDB (34) |
For internal AES use only: desktop image. The ob_spec is a far pointer to a MFDB structure. Supported in the ViewMAX/3 beta and in FreeGEM. |
||||||
G_POPUP (35) |
Popup menu; ob_spec points to a POPINFO structure. If the menu has more than 16 entries, then it can be scrolled. The presence of this object type should be inquired with appl_getinfo (opcode 13). Note: G_POPUP looks like G_BUTTON but the character string is not centred, so as to line up with the other character strings in the menu if possible. |
||||||
G_WINTITLE (36) |
This object number is used internally by MagiC to depict window titles. The construction of this object type may change at any time and is therefore not documented. |
||||||
G_EDIT (37) |
As of MagiC 5.20 an editable object implemented in a shared library is available; ob_spec points to the object. Warning: This type is not yet supported by the functions form_do, form_xdo, form_button, form_keybd, objc_edit, wdlg_evnt and wdlg_do at present, i.e. the corresponding events need to be passed on to the object (with edit_evnt). |
||||||
G_SHORTCUT (38) |
This type is treated in a similar way to G_STRING, but any keyboard shortcut present is split off and output ranged right. The presence of this object type should be inquired for with appl_getinfo (opcode 13). The introduction of proportional AES fonts required new strategy for the alignment of menu entries. So as to align keyboard shortcuts ranged right, objects of the type G_STRING inside a menu are therefore split into commands and shortcuts. This strategy however fails for menus that are managed by the program itself, e.g. within a window or a popup menu. This new object type had to be introduced in order to achieve usable alignment in that case too. |
||||||
G_SLIST (39) |
XaAES extended object - scrolling list. |
Note: For G_BOX, G_IBOX and G_BOXCHAR, the component ob_spec of the OBJECT structure does not point to another data structure, but contains further information for the appearance of the object. The following apply:
Bits |
Meaning |
||||||||||||
|
|
||||||||||||
24..31 |
Character to be depicted (only for G_BOXCHAR) |
||||||||||||
16..23 |
|
||||||||||||
12..15 |
Border colour (0..15) |
||||||||||||
08..11 |
Text colour (0..15) |
||||||||||||
7 |
Text transparent (0) or opaque (1) |
||||||||||||
04..06 |
|
||||||||||||
00..03 |
Inner colour (0..15) |
The high byte is used by the AES only for submenus. If the highest bit of ob_type is 0x8000 and the bit SUBMENU in ob_flags is set, then the bits 8..14 specify which submenu is coupled with the menu entry. Hence each application can have a maximum of 128 submenus. MagiC only reads the low byte from ob_type, apart from the submenu handling. TOS reacts cleanly to unknown object types (such as the purely MagiC types G_SWBUTTON etc.), i.e. the objects are not drawn.
See also: Object structure in AES AES object colours
8.2.4.2 AES object colours
The following table contains the predefined object colours. Of course particulars depend on the selected screen resolution, as well as any settings made by the user.
Number |
Colour |
Standard RGB values |
WHITE (00) |
White |
1000, 1000, 1000 |
BLACK (01) |
Black |
0, 0, 0 |
RED (02) |
Red |
1000, 0, 0 |
GREEN (03) |
Green |
0, 1000, 0 |
BLUE (04) |
Blue |
0, 0, 1000 |
CYAN (05) |
Cyan |
0, 1000, 1000 |
YELLOW (06) |
Yellow |
1000, 1000, 0 |
MAGENTA (07) |
Magenta |
1000, 0, 1000 |
DWHITE (08) |
Light grey |
752, 752, 752 |
DBLACK (09) |
Dark grey |
501, 501, 501 |
DRED (10) |
Dark red |
713, 0, 0 |
DGREEN (11) |
Dark green |
0, 713, 0 |
DBLUE (12) |
Dark blue |
0, 0, 713 |
DCYAN (13) |
Dark cyan |
0, 713, 713 |
DYELLOW (14) |
Dark yellow |
713, 713, 0 |
DMAGENTA (15) |
Dark magenta |
713, 0, 713 |
Note: These colours also correspond mostly to the icon colours used under Windows and OS/2. With a suitable CPX module one can set the correct RGB values for the frst 16 colours.
See also: Object structure in AES AES object types
8.2.4.3 AES object flags
The manipulation flags of an object determine its properties. The following options can be chosen:
Flag |
Meaning |
||||||||||||||
NONE (0x0000) |
No properties. |
||||||||||||||
SELECTABLE (0x0001) |
The object is selectable by clicking on it with the mouse. |
||||||||||||||
DEFAULT (0x0002) |
If the user presses the [Return] or [Enter] key, this object will be selected automatically and the dialog exited; the object will have a thicker outline. This flag is permitted only once in each tree. |
||||||||||||||
EXIT (0x0004) |
Clicking on such an object and releasing the mouse button while still over it will terminate the dialog (see also form_do). |
||||||||||||||
EDITABLE (0x0008) |
This object may be edited by the user by means of the keyboard. |
||||||||||||||
RBUTTON (0x0010) |
If several objects in an object tree have the property RBUTTON (radio button, similar to those on a push-button radio), then only one of these objects can be in a selected state at a time. These objects should all be children of a parent object with the object type G_IBOX. If another object of this group is selected, then the previously selected object will be deselected automatically. |
||||||||||||||
LASTOB (0x0020) |
This flag tells the AES that this is the last object within an object tree. |
||||||||||||||
TOUCHEXIT (0x0040) |
The dialog (see also form_do) will be exited as soon as the mouse pointer lies above this object and the left mouse button is pressed. |
||||||||||||||
HIDETREE (0x0080) |
The object and its children will no longer be noticed by objc_draw and objc_find as soon as this flag is set. Furthermore this flag is also evaluated as of MagiC 5.20 by form_keybd, if objects for keyboard shortcuts are searched for. Input to hidden objects is still possible, however. To prevent this, the EDITABLE flag has to be cleared. |
||||||||||||||
INDIRECT (0x0100) |
ob_spec now points to a further pointer that in turn points to the actual value of ob_spec (see also OBJECT). In this way the standard data structures such as TEDINFO etc. can be extended in a simple way. |
||||||||||||||
FL3DIND (0x0200) |
Under MultiTOS this object creates a three-dimensional object (under MagiC as of Version 3.0 only from 16-colour resolutions onwards and when the 3D effect has not been switched off). In 3D operation this will be interpreted as an indicator. As a rule, such objects are buttons that display a status, e.g. radio-buttons. |
||||||||||||||
ESCCANCEL (0x0200) |
Pressing the [Esc] key corresponds to the selection of the object with this flag. Therefore there may be only one default object in a dialog. Only effective in ViewMAX/2 and later. |
||||||||||||||
FL3DBAK (0x0400) |
In 3D operation this object will be treated as an AES background object, and drawn as such. It is recommended to allocate the ROOT object with this flag in dialogs with 3D buttons. The same applies for editable fields and text objects, as only in this way will a consistent background colour be maintained. See also (0x4000). |
||||||||||||||
BITBUTTON (0x0400) |
This flag was introduced with ViewMAX beta, but not used there. Presumably a button with this flag contains a bitmap in place of a text. Only effective in ViewMAX/2 and later. |
||||||||||||||
FL3DACT (0x0600) |
In 3D operation this object will be treated as an activator. As a rule such objects are buttons with which one can exit dialogs or trigger some action. |
||||||||||||||
SUBMENU (0x0800) |
This is used in MultiTOS and from MagiC 5.10 on to mark submenus. menu_attach sets this bit in a menu entry to signify that a submenu is attached to it. The high byte of ob_typethen contains the submenu index (128..255) i.e. bit 15 of ob_type is always set simultabeously with SUBMENU. |
||||||||||||||
SCROLLER (0x0800) |
Pressing the [PAGEUP] key corresponds to the selection of the first object with this flag in the dialog; pressing the [PAGEDOWN] key corresponds to the selection of the last object with this flag. Only effective in ViewMAX/2 and later. |
||||||||||||||
FLAG3D (0x1000) |
An object with this flag will be drawn with a 3D border. From ViewMAX/2 on every button will be drawn automatically with a 3D border. The colour category (see USECOLOURCAT) will be used for this. Only effective in ViewMAX/2 and later. |
||||||||||||||
USECOLOURCAT (0x2000) |
USECOLOURCAT (0x2000) The colour of the object is not a colour index of the VDI, but an entry in a table with colours for designated categories. This table has 16 entries. ViewMAX uses the following categories:
Probably it is intended to let the categories 0 to 7 be defined by the application, while 8 to 15 are reserved for the system. The settings are stored in ViewMAX.INI (GEM.CFG in FreeGEM) and consist of one foreground, one background, a fill-style and a fill index in each case. Only effective in ViewMAX/2 and later. |
||||||||||||||
FL3DBAK (0x4000) |
3D background (sunken rather than raised). To check for this feature, use appl_init and check that bit 3 of xbuf.abilities is set. |
||||||||||||||
SUBMENU (0x8000) |
Not implemented in any known PC AES. |
See also: Object structure in AES AES object types
8.2.4.4 AES object stati
The object status determines how an object will be displayed later on the screen. An object status can be of the following type:
Status |
Meaning |
NORMAL (0x0000) |
Normal representation. |
SELECTED (0x0001) |
Inverse representation, i.e. the object is selected (except for G_CICON, which will use its 'selected' image). |
CROSSED (0x0002) |
If the object type is BOX, the object will be drawn with a white diagonal cross over it (usually this state can be seen only over a selected or coloured object). See also below. |
CHECKED (0x0004) |
A checkmark tick will be displayed at the left edge of the object. |
DISABLED (0x0008) |
The object will be displayed greyed out and is no longer selectable. |
OUTLINED (0x0010) |
The object gets a border. |
SHADOWED (0x0020) |
A shadow is drawn under the object. |
WHITEBAK (0x0040) |
With PC-GEM this causes the icon mask not to be drawn with the icon, which can speed up output is some circumstances. As of MagiC 3 this controls the underscoring of character strings. This feature can be ascertained with appl_getinfo (opcode 13). |
DRAW3D (0x0080) |
An object is to be drawn with a 3D effect. This flag is of interest only for PC-GEM, and will be ignored by the Atari AES (and also in MagiC). |
HIGHLIGHTED (0x0100) |
An object with this status will be surrounded by a dashed line that is drawn with MD_XOR. This status was introduced with ViewMAX beta. |
UNHIGHLIGHTED (0x0200) |
An object with this status will be drawn with the surround explicitly set by the status HIGHLIGHTED removed. For this one has to proceed as follows: First the status HIGHLIGHTED must be cleared, then the status UNHIGHLIGHTED set and following this the object must be redrawn with the function objc_draw. A redraw of the object without the status UNHIGHLIGHTED would not remove the surround, as it lies outside the area that the object occupies. After the redraw the status UNHIGHLIGHTED should be cleared again. This status was introduced with ViewMAX beta. |
UNDERLINE (0x0f00) |
This opcode is available in MagiC from Version 2.0 onwards, and sets the position and size of the underscore for objects of the type G_STRING, G_TITLE and G_BUTTON. |
XSTATE (0xf000) |
This opcode is available in MagiC from Version 2.0 onwards, and serves for switching for the various button types (G_STRING, G_TITLE and G_BUTTON). |
In GEM/5, CROSSED makes the object draw in 3D:
- If an object is both CROSSED and SELECTABLE, then it is drawn as a checkbox.
- If it is CROSSED, SELECTABLE and an RBUTTON, it is drawn as a radio button.
- If it is a button or a box and it is CROSSED, then it is drawn as a raised 3D shape, similar to Motif.
- If a button is CROSSED and DEFAULT, a "Return key" symbol appears on it (rather like NEXTSTEP).
- Boxes and text fields that are CROSSED and CHECKED appear sunken.
GEM/5 can be detected by calling vqt_name for font 1. If nothing is returned, GEM/5 is running.
Recent FreeGEM builds contain a system based on the GEM/5 one, but extended and backwards-compatible. The DRAW3D state is used instead of CROSSED:
- If an object is both DRAW3D and SELECTABLE, then it is drawn as a checkbox.
- If it is DRAW3D, SELECTABLE and an RBUTTON, it is drawn as a radio button.
- If a button is DRAW3D and DEFAULT, a "Return key" symbol will be on it.
- If an object with a 3D border has the WHITEBAK state, then the 3D border will not have a black edge.
- If a radio button or checkbox has the WHITEBAK state, then it will be drawn with a white background rather than in the colour used by 3D objects.
To check for these abilities, use appl_init and check that bit 3 of xbuf.abilities is set.
See also: Object structure in AES AES object types
8.2.5 The quarter-screen buffer
The quarter-screen buffer is required by the screen-manager to save the contents of the menu background when drop-down menus drop down. The 'QSB' (the usual abbreviation) is also used for the display of alert boxes. Normally its size should depend on the number of colour planes and the size of the system font, but not on the total size of the screen.
A good formula would be:
500(characters) * space of one character * colour planes
In 'ST High' resolution this would give a value of exactly 8000 (i.e. a quarter of the screen memory). Unfortunately in many cases the AES is not so clever. The following table contains a summary of the algorithm used by some GEM versions:
GEM version |
Method for setting the QSB |
1.0 and 1.2 |
Static, 8000 bytes |
1.4 |
Dynamic, a quarter of the screen memory |
3.0 |
Dynamic, half of the screen memory |
Note: The GEM versions 1.0 and 1.2 (i.e. up to and including TOS Version 1.02) are not prepared by this for colour graphics cards - one of several reasons why even with the use of a special VDI driver under these GEM versions one can not make use of colour graphics cards.
See also: GEM
8.2.6 The rectangle-list of a window
To overcome the problem of windows that overlap each other, the AES holds for each window a so-called rectangle-list; when a window is partially obscured, GEM divides the visible portion of that window into the least possible number of non-overlapping rectangles, the details of which are then stored in the rectangle-list. Thus the elements of this list form a record of the currently completely visible working area of the corresponding window.
To redraw a window (or its contents) one first inquires with the function wind_get(WF_FIRSTXYWH) for the first rectangle of the abovementioned list. Then one checks whether this rectangle overlaps with the screen area to be redrawn; then and only then one can redraw this area with the use of vs_clip.
This method will be continued with all remaining elements of the rectangle-list, until the height and the width of a rectangle have the value zero.
See also: Clipping WM_REDRAW wind_get wind_update
8.2.7 The screen-manager
The screen-manager is always active and supervises the position of the mouse pointer when this leaves the working area of the window of other applications. The areas in question are the frames of the windows, the drop-down menus and the menu bar.
When touching the menu area, the screen-manager automatically ensures that the section of the screen occupied by the menu is saved and later restored again (the quarter-screen buffer is used for this).
Manipulation of the window controllers also do not lead to permanent changes of the screen memory; the result of the interaction with the screen-manager are the so-called message events, which inform the relevant application about the user's actions.
Note: The ID of the screen-manager can, incidentally, be found easily by a call of appl_find("SCRENMGR").
See also: About the AES GEM Messages
8.2.8 Toolbar support
From AES version 4.1 onwards the operating system supports so-called toolbars. A toolbar is an OBJECT tree that is positioned below the information-line of a window (and above the working area) which makes it possible to display buttons, icons etc. in a window.
As already known from the window routines, the management of toolbars is shared betwen the AES and the application. Here the AES is responsible for the following actions:
- Adaptation of the X- and Y-coordinates of the toolbar when the window is moved or its size is changed.
- Ensuring that the window is configured to the size required by the window components and the toolbar.
- Adjustment of the toolbar's width to the width of the window.
- Redraw of the toolbar on receipt of a WM_REDRAW message.
- Sending of AES messages when the user activates an object of the toolbar.
The application, on the other hand, must look after the following:
- Construction of an OBJECT tree for the toolbar (in particulat one has to ensure that all selectable elements of the toolbox have the status TOUCHEXIT).
- Adjustment of the width of a toolbar object if this depends on the width of the window (may be required when changing the size of the window).
- Handling of USERDEF objects.
- Redrawing all objects whose appearance is to be changed. In this case it is imperative that the rectangle-list of the toolbar is inquired for and/or taken into account.
- Problems that arise in connection with the screen resolution have to be solved. Thus, for instance, the height of an icon in the ST Medium resolution can differ from the height of the icon in the TT030 Medium resolution.
For supporting toolbars in your own programs, you should respect the following points:
- Communication with the AES window-manager
- Problems with wind_calc
- Redraw and updating of toolbars
- Toolbar support under XaAES
See also:
WF_TOOLBAR WF_FTOOLBAR WF_NTOOLBAR WM_TOOLBAR wind_get wind_set
8.2.8.1 Redraw and updating of toolbars
For redraws of (parts of) the toolbar, one has to pay respect to the rectangle-list as usual. As the previous wind_get opcodes WF_FIRSTXYWH and WF_NEXTXYWH only respect the working area of a window, however, two new parameters (WF_FTOOLBAR and WF_NTOOLBAR) were introduced, with whose help the rectangle-list of a toolbar can be interrogated.
A redraw of (parts of) the toolbar may be necessary in the following situations:
- The toolbar contains user-defined objects (USERDEF's).
- The status of an object in the toolbar was altered. The area to be redrawn here consists of the size of the object plus the space required for special effects (3D, shadowing, outlining etc.).
Redraw is not necessary in the following cases, for instance:
- The relevant window is iconified. The application does not have to take on any management of the toolbar; this is only required at the restoration of the iconification, the so-called uniconify.
- The toolbar present in the window is to be replaced by another one. In this case a call of wind_set with the opcode WF_TOOLBAR and the address of the new OBJECT tree will suffice.
- The toolbar present in the window is to be removed. In this case a call of wind_set with the opcode WF_TOOLBAR and NULL parameters will suffice.
See also: Rectangle-list of a window Toolbar support
8.2.8.2 Toolbars and the window-manager
For handling toolbars an application can have recourse to the window- manager of the AES. In detail:
For tacking on a toolbar to a window, it is sufficient to call wind_set(handle, WF_TOOLBAR, ...) with the address of the toolbar object tree. If this call is executed while the window is open, then it is itself responsible for the correct calculation of the height of the toolbar.
To exchange a toolbar for another one, one can have recourse to a call of wind_set(handle, WF_TOOLBAR, ...) with the address of the new toolbar. If this call is executed while the window is open, then it is itself responsible for the correct calculation of the height of the (new) toolbar.
To remove a toolbar from a window, it is necessary to call wind_set(handle, WF_TOOLBAR, ...) with NULL parameters. If this call is executed while the window is open, then it is itself responsible for the correct calculation of the height of the toolbar.
In addition the following points have to be taken into consideration:
- If a window is closed with wind_close, then any toolbar present will not be removed. At a later reopening the toolbar will be in place once more.
- If a window is removed with wind_delete, then its link to a toolbar will be dissolved.
- To be able to recognize mouse-clicks on toolbar objects, these have to possess the status TOUCHEXIT. When such an object is clicked on, the AES creates a WM_TOOLBAR message which is sent to the relevant application.
See also: AES GEM Toolbar support
8.2.8.3 Problems with wind_calc in toolbar windows
When applying the function wind_calc to windows that possess a toolbar there are several problems to be taken into account:
As this function is not passed a window ID (window handle), the desired sizes cannot be calculated correctly when a toolbar is present in the window. The reason for this is that, quite simply, the AES in this case has no information about the toolbar, and specially about its size.
Hence the values returned by wind_calc in such cases have to be further refined by the application. As the program can access the relevant OBJECT tree (and with this also the height of the toolbar), this should present no problems. In detail:
- When ascertaining the border areas of the window, the height of the toolbar must be added to the height returned by the function.
- When ascertaining the working area of the window, the height of the toolbar must be added to the Y-value (couty) returned by the function.
Note: Besides the height of the actual object, the height of the toolbar should include also the space requirement for special effects (3D, shadowing, etc.).
See also: WF_FTOOLBAR WF_NTOOLBAR WM_TOOLBAR objc_sysvar
8.2.9 AES bindings
The AES is called via a single subprogram that is passed 6 parameters; these are addresses of various arrays that are used for input/output communications. To call an AES function, the following parameter block must be populated with the addresses of the arrays described below:
typedef struct
{
int16_t *cb_pcontrol; /* Pointer to control array */
int16_t *cb_pglobal; /* Pointer to global array */
int16_t *cb_pintin; /* Pointer to int_in array */
int16_t *cb_pintout; /* Pointer to int_out array */
int16_t *cb_padrin; /* Pointer to adr_in array */
int16_t *cb_padrout; /* Pointer to adr_out array */
} AESPB;
The address of this parameter block (which lies on the stack) must be entered in register d1, and subsequently register d0.w must be filled with the magic value 0xc8 (200). With a TRAP #2 system call the AES can then be called directly. For the Pure-Assembler this could look like this, for instance:
.EXPORT aes ; Export function
.CODE ; Start of the code-segment
aes: MOVE.L 4(sp),d1 ; Address of the parameter blocks
MOVE.W #200,d0 ; Opcode of the AES
TRAP #2 ; Call GEM
RTS ; And exit
.END ; End of the module
There is no clear information available about which registers may be altered. In fact, however, the corresponding routines in ROM save all registers.
Now to the individual arrays. With each field, designated input or output functions can be performed. The following apply:
int16_t control[5] |
With this field information about the called function and its parameters can be determined. The following apply:
There is no clear information about which elements must be set before an AES call. It is required in each case for elements [0],[1] and [3]. It seems less sensible for the elements [2] and [4] - after all the AES functions know how many values they return in the output fields. |
||||||||||||||||||||||
int16_t global[15] |
This field contains global data for the application and is used partly by appl_init and partly by other AES functions, and is filled automatically. The following apply:
|
||||||||||||||||||||||
int16_t int_in[16] |
All 16-bit-sized input parameters are passed with this field. |
||||||||||||||||||||||
int16_t int_out[10] |
All 16-bit-sized return values are supplied by the AES via this field. |
||||||||||||||||||||||
int32_t addr_in[8] |
This field serves for the transmission of pointers (e.g. pointers to character strings) to the AES functions. |
||||||||||||||||||||||
int32_t addr_out[2] |
All 32-bit-sized return values are supplied by the AES via this field. |
Warning: If the operating system supports threads, then it is impera tive that a multithread-safe library is used. In particular one must ensure that each thread receives its own global field (see above).
See also: Sample binding VDI bindings TOS list
8.2.9.1 Sample binding for AES functions
The function 'crys_if' (crystal interface) looks after the proper filling of the control arrays, and performs the actual AES call. It is passed one WORD parameter in d0 containing the funtion's opcode minus 10 multiplied by 4 (for faster table indexing); this gives an index into a table in which the values for control[1], control[2] and control[3] are entered for each individual AES function.
AESPB c;
int16_t crys_if (int16_t opcode)
{
int16_t i, *paesb;
control[0] = opcode;
paespb = &ctrl_cnts[ (opcode-10)*3 ];
for (i = 1; i < 4; i++)
control[i] = *paespb++;
aes (c);
return (int_out[0]);
} /* crys_if */
The table used for this could be built up as follows, for instance:
.GLOBAL ctrl_cnts
.DATA
ctrl_cnts: .dc.b 0, 1, 0 ; appl_init
.dc.b 2, 1, 1 ; appl_read
.dc.b 2, 1, 1 ; appl_write
...
...
...
.END
A fuller version is given in The Atari Compendium pp. 6.39-41. Note that the rsrc_gaddr call must be special cased in a library if you want to use the crys_if binding to call the AES.
See also: AES bindings GEM
Data Mining and Ethics, Data Warehousing, and Data Analysis’s Effectiveness in Today’s Business World
Shaun J Whittaker
Strayer University
April 2011
Content of the Problem
Data mining is defined as the process of making discoveries from large amounts of detailed data. One of the benefits of data mining is to better understand customer behavior leading to more effective merchandising and market strategies. Data mining centers on a better understanding of customers’ behavior. The most critical need is to capture transaction details at the point of sale, then promptly break the details’ apart at the end of the day into summaries of time movements by companies. Stores use data warehouse techniques to represent the universe of data available to be mined (Mason, 1995).
Corporations mine data from transactions and study buying habits. Corporations use this information to market to individuals. Pharmaceutical companies use this information to sell doctors drugs, Retail advertise their products on to the customer through the Internet, and marketing tools. Data mining can be categorized into two segments it is used through a business strategy called Customer Relationship Management(CRM) and Business Intelligence(BI) which refers to skills, technologies, applications and practices used to help a business acquire a better understanding of its commercial context. BI may also refer to the collected information itself (Ryan, 2009).
The paper’s purpose will converse about data modeling which is the beginning of databases. Data models help structure all databases through ER diagrams. Conceptual modeling is an important stage in designing a successful database application. The concepts in a data model are usually represented diagrammatically. A conceptual schema diagram must be powerful in its semantic expressive and easily comprehensible, as it serves as a communication medium between professional designers and users who interact during the stage of requirements analysis and modeling. Once approved by users, the conceptual schema is converted into a specific database schema depending on the data model and the Data base Management System that is used for implementation. The major problem, however is to create a good conceptual schema that is semantically correct, complete, easy to use, and comprehensible. The entity relationship (ER) model is one of the most widely used conceptual data models (Peretz Shoval, 2005). Data Mining is an advanced database that retrieves information from consumers through purchases so that corporations can use information to advertise to consumers. The question ultimately: how to make a great advanced database and is data mining ethical?
Conventional database technology has laid particular stress on dealing with large amounts of persistent and highly structured data efficiency and using transactions for concurrency control and recovery (WU, 2000).
Model building is a key objective of data mining and data analysis applications. In the past, such applications required only a few models built by a single data analyst. As more and more data has been collected and real world problems have become more complex, it has become increasingly difficult for that data analyst to build all the required models and manage them manually. Building a system to help data analyst construct and manage large collections of models is a pressing issue (Tuzhilin, 2008).
The advent of data models has been the impetus for enormous progress in data management. Conceptual advances have provided a framework for elaborating database design, and wonderful tools have enabled data professionals to design databases in practice. Today, data modeling is viewed as a necessary skill in data management. There is the assumption that a data model can capture all the information about the design of a database. This assumption is rarely questioned, but is it true? This is just not a question of whether different data modeling approaches yield different levels of accuracy about how an enterprise sees information in a particular subject area. Any data model can truly specify all the design information for a database. There are real limits to what data models can do, and failure to understand these limitations can result in data management problems at a number of levels (Chisholm, 2007).
The purpose of this paper is to inform the reader of data warehousing a repository database for data mining. This paper will talk about an advanced database procedures called data mining in which corporations utilize. Data mining can be broken down into two major parts. The first is data warehouse; a typical organization maintains and utilizes a number of operational data sources. These operational data sources include databases and other data repositories that are used to support the organization’s day –to-day operations. A data warehouse is created within an organization as a separate data store whose primary purpose is data analysis for the support of management’s decision making process. There are two main reasons that necessitate the creation of data warehouses as a separate analytical data store. The first reason is that the performance of operational queries can be severely diminished if they must compete for computing resources with analytical queries. The second reason lies in the fact that, even if the performance is not an issue, it is often impossible to structure a database that can be used (queried) in a straightforward manner for both operational and analytical purposes. Therefore, a data warehouse is created as a separate data store, designed for accommodating analytical queries. A typical data warehouse periodically retrieves selected analytically useful data from the operational data sources. In so called active data warehouse, the infrastructure that facilitate the retrieval of data from operational databases into the data warehouse is known as ETL, which stands for Extraction, Transformation, and Load (Jukie, 2006).
The discussion of the paper will go into how the industry can effectively abstract this information; you can do this through data analysis. Modern scientific instruments can collect data at rates that less than a decade ago, were considered unimaginable. Scientific instruments coupled with data acquisitions systems can easily generate terabytes and peta bytes of data at rates as high as gigabytes per hour. There is a rapidly widening gap between data collection capabilities and the ability of scientists to analyze the data. The root of the problem is fairly simple: the data is increasing dramatically in size and dimensionality. While it is reasonable to assume that a scientist can work effectively with a few thousand observations each having a small number of measurements, they can’t digest millions of data at a time. Large data sets with high dimensionality can be effectively exploited when a problem is fully understood and the scientist knows what to look for in the data via well defined procedures. By reducing data can reduce the millions of data that the scientist has to sift through. The problem of effective manipulation and exploratory data analysis is looming as one of the biggest hurdles standing in the way of exploiting the data (Fayyad, Haussler, & Stolorz, 1996).
The second half of the paper will discuss the ethics of data mining. When transactions are performed in stores, or information is filled out on the Internet, Corporations obtain this information. The information that is obtained is used for marketing. Many consumers if they were aware would become irate to the fact. Consumers want privacy. Some businesses that use Customer Relationship Management use data for corrupt reasons. This paper will raise the question, is data mining ethical.
As more and more customers are aware of data mining many customers are concerned. Ethics are important in all aspects of business, but the development of digital technology provides a powerful tool, these tools can be used for unethical purposes. Data mining provides the ability to scan and analyze vast masses of electronic data to identify previously unseen patterns.
There are a lot of ethical issues about data mining, just as other advanced technology, even though it is useful, but if would cause many problems. With privacy concerns arising, privacy advocates worry much about the abuse of personal information. (Ting)
Three issues come into play here, privacy, ownership, and consent. Many consumers feel that their privacy is violated by these information gathering practices. The data gathering companies claim the information they are gathering is a public good gathered in a public sphere and that therefore privacy is not being violated. They assume the users consent to use the information gathered when the user voluntarily uses services that are monitored or fills out a form. They do not make an effort to inform the user of the future uses of his data or to provide him with a means of opting out of the practice. In the case of registration forms, many businesses do provide a privacy policy for the user to read before submitting the form. However, these privacy policies can be difficult to understand and may not be consistently followed by the organization. Also the user doesn’t usually have much choice but to agree if he or she wants to use the services provided by the website. Another concern is the type of data being collected. Some types of personal information are seen as being more sensitive than others. What complicates this issue is that sensitivity level varies according to the individual. One consumer may view income level as very sensitive and private but not care who knows their marital status. Another consumer may have reasons to view their marital status as very sensitive information that might affect their employability or some other potential opportunity while not caring who knows their income level. This makes it difficult for a company to gauge customer reaction to new users of their data. Aside from information collected publicly from sources such as the Internet, much information is also bought from more seemingly private sources. This information can include credit history, financial information, employment history, and possibly some medical information. Many consumers would be surprised to know that these types of information are routinely bought (Chritina Cary, 2003).
Market researchers are also accumulating, via computerized files, an increasing amount of information about their customers. Given the compiling capabilities of database software, much of the information is aggregated from disparate buying situations. Made easier by widespread consumer acceptance of preferred buyer cards and credit purchasing, such information is combined using personal identifiers such as phone number, household address, driver’s license registration or even social security number. Then using data mining techniques sophisticated multi-variable statistical models that can extract scattered information from large consumer data pools-marketers are able to construct individual consumer profiles for millions of shoppers. Disturbingly, these profiles are then copied and sold to other marketers who use it to predict likely purchase prospects for their goods and services. As a result, a growing and permanent record exists of what individual consumers buy, where they bought it, the price paid, and the incentives that motivated the transaction. Taking all of this into account, it is understandable that many consumers are troubled by certain technology aided marketing practices that might be construed as prying, irritating, and exploitive. A perusal of the business and popular press suggests that marketing practitioners have already been mounting a defense to the perceived ethical criticisms of their new technologies (Laczniak & Murphy, 2006).
Furthermore, this paper will discuss the complex challenges that database programmers have to face when dealing with advanced databases such as data warehousing and data mining. In order to establish data mining a data warehouse has to be harnessed, stating that, data modeling is a form of data warehouse. This paper will show how data warehouses and data modeling are efficiently build and demonstrate what not to do when constructing a data warehouse. In Tuzhlin’s paper he discusses how to fix complex data modeling. This paper will also focus on data mining ethics. It is a major topic in the information system field that consumers and Customer Relationship Managers debate on whether data mining is ethical and how can information be filtered, so that consumers would not be harmed by corrupt businesses.
Statement of Problem
How can advanced data warehousing, data analysis, and data mining be utilized effectively?
As more and more data is collected and real world problems have become more complex, it has become increasingly complex, it has become increasingly difficult for the data analyst to build all the required models and manage them manually (Tuzhilin, 2008). The growing application of data mining to boost corporate profits is raising many ethical concerns especially with regards to privacy (Cary, 2003)
In Conclusion, this paper will prove how to efficiency program data mining. Demonstrate if database developers program a good data model that the data warehouse will be effective. Finally show how data mining is regulated by the government.
The sub problem of the paper will focus on:
- What makes a proficient ER Diagram?
- Can OLAP improve data warehousing
- How to extract effective data mining (data) inside the warehouse through data analysis.
- Is Data Mining ethical?
Significance of Study
Data mining and Data warehousing is a key operation for businesses. Businesses use a tool called Customer Relationship Management to measure the needs of customers. This is all done through advanced databases discussed in the paper as data mining in which data is stored in a data warehouse. Most businesses are dependent on data warehousing, so it is detrimental to talk about corporations and how they need these advanced databases. Corporations benefit from data mining and data warehouses because the programmers that they hire would be better equipped to program the corporation’s databases if they are educated, and know how to make effective databases.
This study is important because Data Warehousing has become a standard practice for most large companies worldwide. The data stored in the data warehouse can capture many different aspects of the business process, such as manufacturing, distribution, sales, and marketing. This data reflects explicitly and implicitly customer patterns and trends, the effectiveness of business strategies and resultant practices, and other characteristics. Such data is of vital importance to the success of the business whose state it captures. This is why companies decide to engage in the relatively expensive undertaking of creating and maintaining a data warehouse, where the costs routinely reach millions of dollars (Jukie, 2006).
A fundamental of computing processes is that data entry is expensive and data storage is cheap. The financial sector can reap huge benefits from using information generated by data warehousing. This process was developed to gather together and process the information already held in existing databases so that new facts could be revealed, making better use of the information held in the data files of the organization. An important feature of this process is that no new databases are created and that the original database is copied into the data warehousing system where it is processed and used to prepare charts and graphs from which new facts will emerge. The demand for data warehousing solutions is increasingly being driven by a requirement for analytical applications focused around specific market or business needs.
The use of data warehouses has grown significantly, and it will continue to do so. A major contributor to this trend is the increasing interest in using the data warehouse to understand the customer so that companies gain competitive advantage and market themselves more effectively. This inherently requires analyzing data at the level at which the customer interacts with a company; at the transaction level.
Data mining is a way to access the information stored. It uses the most sophisticated statistical analysis to discover relationships in the data. Credit card companies use data mining to identify patterns of use indicative of fraud, and data mining techniques can also be used to identify previously unknown relationships in sales data that can be uses as the basis for future promotions.
There are two sets of tools for data mining: those for discovery of patterns and trends in the data and those for verification. Discovery tools include data visualization, neural networks, cluster analysis and factor analysis.
During their work with data warehouses PricewaterhouseCoopers found that the majority of data warehouses were built in order to generate increased revenue through better marketing and customer segmentation. Major retailers are making effective use of the data that they collect about their customers through their loyalty card schemes.
However, not all applications are sales focused. Data warehouses are ideal environments for performance measurement, which is a field of growing importance to large organizations. A data warehouse is a business tool that must deliver measurable financial benefits to the investor. The investment can never be justified in IT terms or unclear notions about effective data management (Corbitt, 2006) .
Researchers in the past discuss data warehousing/data modeling problems maybe efficiency can be learned from the past. In order to create a data warehouse a data model is the key element to any database, so this paper will focus on the importance of data modeling and touching on the ER diagram. Useful systems depend on requirements analysis, of which data modeling is an essential part. Yet modeling is often abandoned as a cumbersome and ineffective technique that cannot be understood by potential system users. Because the fault lies with the way the models are designed and not with the underlying complexity of what is being represented , guidelines regarding the way the models are constructed and their components named help ensure the creation of accessible models that are useful for both understanding a business and data base design (Hay, 1998).
In order to create data mining a data warehouse has to be created. In the present business world, corporations rely on data mining for marketing and other services. Throughout the 1980s and early 1990s major corporations adopted Business Intelligence(BI) tools such as report writers, spreadsheets, and , more recently, OLAP to gain a competitive advantage in decision making. These systems, although essential for monitoring and planning, are unable to cope with the volumes of data or the sophisticated analysis required for strategic decision making (Rawlings, 1999).
In the old days, knowing your customers was part and parcel of running a business, a natural consequence of living and working in a community. But for today’s big firms, it is much more difficult: a big retailer such as Wal-Wart has no chance of knowing every single one of its customers. So the idea of gathering huge amounts of information and analyzing it to pick out trends indicative of customers’ wants and needs –data mining –has long been trumpeted as a way to return to the intimacy of a small town general store (Clark, 2004).
Business leaders in the current enterprise environment demand fast and complete information on everything, from sales and competition to people and projects, and more. Thanks to the widespread implementation and growth of data warehousing the information exists. Mangers know that there’s no substitute for up-to-the minute knowledge at their fingertips. And data warehousing has turned knowledge into power for today’s effective and productive business.
Setting up a data warehouse is a huge project that involves many steps and procedures from outlining and understanding core business objectives and requirements to database design (Mining Customer, 1999).
Managers are dependent on detailed, accurate information when they make decisions. Data warehousing has shown that it can satisfy this need for information about customers, suppliers, market trends, the competition and how efficient the company’s business procedures are. These days there is no room for guesswork; decisions have to be made on the basis of careful data analysis.
Whether it’s a decision-support system, a CRM, a data warehouse, or any variant of OLAP, enterprises have embarked on a never-ending quest for more knowledge. Both implementing such systems and using them involve challenges. And implementation is far more difficult than opening a shrink-wrapped package and grafting it onto an existing database.
Launching an enterprise-wide CRM/data warehouse solution may require full scale organizational and procedural change. Everything must be thought through, from marketing goals to business procedures to technology. Some firms already have huge databases, but mining the data within them is another challenge. Common standards must exist before businesses can allow users to have seemingly unlimited access. In some cases, a single CRM product might fill your company’s needs.
The population that this study will effect is internet users that are affected by corporations that are data mining. There are 71.0 internet users currently in the United States according to the 2009 Statistical Abstract of the United States (Bureau, 2009). Also, there are 43.0 million computer science students in graduate and post doctorate classes that could learn about the effectiveness of data modeling, data warehousing, and data mining. Another population that this study could effect is computer systems designers in which there are 93.1 million, they could benefit from the study, so this population does not come across any database design problems.
The implications regarding the study might prevent students, or database designers of making mistakes when programming databases. Although there are always mistakes and designing data warehouses are not flawless, maybe one day this paper can influence someone to come up with a design that is flawless, so programming can be easier for development and software in the business world. Ultimately this paper is put together to make the population aware certain flaws in the database design. The second half of the paper will make consumers aware of their rights when it comes to data mining.
This paper’s impacts a certain population. Although data modeling has been around for nearly 25 years, it is one of the top areas where database application problems arise. These problems range in severity from incorrect functionality to abysmal performance. How can such an established technique yield such terrible results? The answer is unnerving : today’s data modeling tools are amazingly good, but the people using them often lack the knowledge and experience needed for successful data modeling (Scalzo, 2008).
No Data Warehouse implementation can succeed on its own. The trick is knowing when and how to intervene. Data warehouses have tremendous potential to present information. They provide the foundation for effective business intelligence solutions for companies seeking competitive advantage. While there have been notable success, there have also been significant failures. What accounts for such conflicting results? In the 1990s, an adaptive structuration theory (AST) was developed to understand conflicting results with group decision support systems. This theory analyses the technological and contextual aspects of the application of a technology focusing on their interactions. Using AST this paper can examine the interaction of context and technology, and pinpoint seven key interventions specific to that interaction for data warehouse success (Tim Chenoweth, 2006).
The growth of information technology and its enhanced capacity for data mining have given rise to privacy issues for decades. The advent of the Internet and its unprecedented opportunity for communication, community building, commerce, and information retrieval have exacerbated this problem. Online retailers can track user’s site behavior in order to create user profiles, enhance the functionality of their Web sites, and target offering to customers on subsequent Web site visits (Pollach, 2996)
Research and Methodology
The research proposal will use qualitative research. The rationalization for using qualitative research for this paper is to seek an “understanding of the complex phenomena” (Leedy and Ormond, 2005, p.94) which is programming an effective data warehouse. The paper builds an observation through the start by first discussing data model techniques. The process theories that help explain the phenomenon under the study are experts’ opinions on efficient programming. The sub problem questions that would answer the effectiveness of data mining, data modeling, and data warehouses are the following.
The sub problems of the paper will focus on:
- What makes a proficient ER Diagram?
- Can OLE DB improve data warehousing?
- How to extract effective data mining (data) inside the warehouse through data analysis.
- Is Data Mining ethical?
According Leedy this paper will choose all four approaches qualitative approaches (Leedy and Ormond, 2005, p.134-135). One example of revealing the problems an ER diagram this paper will demonstrate the description of the research paper. Also, this paper will focus on complex situations such as programming difficulties, and what not to do and what to do when programming data warehouses. A student or entry level computer scientists can gain an insight on the effectiveness of this problem in this small group population that is why interpretation comes into effect in this paper. This paper will show the verification by the reader, reading the opinions of experts in the field. The research shows the effectiveness of sub problems because the experts are right proving that there is evaluation in the approach.
The researcher will use content analysis. According to Leedy (Leedy and Ormond, 2005, p.143) a content analysis is a detailed and systematic examination of the contents of a particular body of material for the purpose of identifying patterns, themes, or biases.
The content analysis involves the following steps
- A description of the body of material you studied.
- Precise definitions and descriptions of the characteristics you looked for.
- The coding or rating procedure.
- Tabulations for each characteristic.
- A description of patterns that the data reflect.
According to Leedy (Leedy and Ormond, 2005, p.208), in research, bias is any influence, condition, or set of conditions that singly or together distort data. Data are, in many respects, delicate and sensitive to unintended influences. A contradiction could be stated probably because database design is not perfect. The researcher might not find a right way to make an effective data mining, data modeling, or data warehouse database. The research could only find research that states data bases are effective whereas some experts might state that data bases can never be programmed effectively. This paper could be bias.
The results of the paper will show the population, how to create an excellent database with less mistakes in which beginners have. Nothing is perfect, nor is programming databases in which are proved by the researcher. The population will learn good techniques on programming, and problems in programming. The result could possibly make life easier in this population of study.
Organization of Study
The research for this study will be organized and presented in the following eight chapters:
Chapter 1: Introduction
The purpose of the introductory chapter is to present the context of the problem along with the problem statement, the research question, and sub questions. This chapter will also present the purpose and significance of the study, along with the theoretical research that provided the foundation for the study’s motivation and background.
Chapter 2: Review of Related Literature
The purpose of the literary review chapter is to provide a discussion of the related literature reviewed in scholarly journals, including articles that dwelled in academic databases. Statistical websites were also established in the literature review as well as technological databases and websites are provided for the background of the research study.
Chapter 3: ER Diagrams.
This chapter will discuss the basics of data modeling. An ER diagram is the outline on constructing a database. An ER diagram has to be formed to make a good data model. This chapter is very essential in regards for readers to know proper programming techniques.
Chapter 4: OLAP
This chapter will focus on effective data warehouses. This chapter will talk about an OLAP which is essential to a data warehouse. OLAP is key for data warehouses to be analyzed and information to be query. This chapter will talk about multi dimensional and cubes in regards to OLAP
Chapter 5:.Data Analysis
This chapter will concentrate on the fact that data analysis is an important role in data mining. What are great procedures for handling data analysis? What tools should the industry use in order to solve data analysis problems such as millions of data coming into the data warehouse?
Chapter 6: Data Mining Ethics
This chapter is for the population that is internet users .The researcher will demonstrate the opinions on consumers how they feel about data mining. The researcher will discuss the laws of data mining and try to answer the question is data mining ethical. Chapter Summary
The importance of data warehouses is huge in the information systems field and important at anyone’s job because databases are in every business and hospitals. The purpose of the paper is to educate a certain population on how to effectively program efficient databases, specifically data mining. The background and motivation of this study came from the researcher when the researcher studied data mining and its ethics. The researcher interest in databases is deep and intriguing, especially efficient programming. The findings of the study will aide others who need to be knowledgeable of its effectiveness and data mining ethics. The intent of the current study is to answer the questions:
- What makes a proficient ER Diagram?
- What is a OLAP and what makes it effective for data mining?
- How to extract effective data mining (data) inside the warehouse through data analysis.
- Is Data Mining ethical?
Chapter 2: Review of Literature
The literature review will include four research questions concerning advanced databases known as data mining and data warehousing. The first question arises which is how to effectively create an ER Diagram? The second question will come in to play about data warehouses: How do you effectively created a data warehouse & how efficient is an OLE DB? The third question is: How do you sift through data as a data analysis person? The fourth question is how do you effectively create a data mining tool & is data mining ethical?
Data warehouses are designed for business users to data mine information. The paper will communicate how effective a developer can create a data warehouse and data mining tool. The first thing one has to conceive is data modeling. Data Modeling is the core or bare essential in creating all databases. On a Business Analyst standpoint professionals have to find out what makes a great data model.
Title: The Importance of Business Understanding in Requirements Structuring
Author: Byran James
Date: April 20 2011
When asking what makes a good ER diagram analyst is really asking, “How well does this model support a sound overall system design that meets the business requirements?”
Completeness- Does the model support all the necessary data?
Data Reusability- Can the data be made available to support additional information requirements?
Stability-If the organization’s business requirements change, can the model remain intact?
Flexibility- Can the model be readily extended to support new business requirements?
Elegance-Does the models provide a neat and simple classification the data?
Communication-Does the models represent concepts that users and programmers will understand?
Integration-Does the proposed model fit with the organization’s existing database?
Conflicting Objectives-Many of the above objectives will conflict with one another. For example a model may be simple and elegant, but does a poor job capturing all of the necessary business requirements…Does the model provide the best balance among conflicting objectives (James B. , 2011)?
Data Modeling Using Entity Relationship Diagrams: A Step Wise Method
Michael Chilton
Poor database designs often result from the inability to achieve data independence in the data model. Data independence prevents or reduces the problem of medication anomalies and allows new and updated application to be written without having to change the structure of the database. One potential reason for this difficulty is the departure from, or the lack of use of a simplified method for identifying, grouping and relating the relevant facts that users need.
The ER model was developed by Chen to take advantage of the strengths of the network model, the relational model and the entity set model by achieving data independence and capturing the important semantic information found in the real world and doing so in a way that produces a more natural view. The ER diagram is less confusing and easier to follow for introductory database students because it provides a pictorial representation of the data structure. In addition it allows a programmer to quickly formulate queries by visually mapping a query to the data. The production of an ER diagram does not guaranteed a good database design, however, because the student must know how to form entities that correctly mirror the data that each user is exposed to and how to relate these entities. These processes are used by the system analyst in his design. They are reflected in data flow diagrams and show the data in motion throughout a system. In database design we are concerned with data at rest. The questions to be answered are 1) have we stored the data correctly to prevent unwanted redundancies. And 2) have we store the data so that the systems analyst and programmers can access it and utilize it as either input or output to their processes that users execute?
Building an effective ERD begins as a simple three step process. It consists of 1) determine the data requirements. 2) Grouping the data together to form entities; and 3) the database designer must iterate through these steps until all of the data is accounted for (Chilton).
Title: The Data Modeling Handbook
Author: Reingruber Gregory
Planning your modeling approach to address each one you can significantly increase the likelihood that you data model will exhibit characteristics rendering them useful for business analysis and information system design. Each dimension contributes uniquely to the overall quality and utility of the model. You cannot ignore one dimension and expect to make up for it in the other three dimensions. The five dimension of quality is conceptual correctness: rules of the enterprise. Syntactic Correctness implies the objects contained in the data model do not violate any of the established syntax rules of the given language. Enterprise awareness is the underlying concept that must be factored into any discussion of data model quality. The author suggests that you look at the scope of the enterprise and its requirements (Reingruber & Gregory, 1994).
The second research question delves into is an OLAP efficient for data warehouses? The question is, yes. Data warehouses need OLAP in order to extract information for the repository.
Title: Power analysis system based on data warehouse
Author: Hui Li, Juan Chu
A data warehouse is primarily, a data collection which is to favor decision making of enterprises or corporations, subject oriented, compulsive, cannot be renewable and can be changed at any time. Because data warehouse doesn’t have strict mathematic theory base, it leans to be an engineering project. In technology it can be divided into key technologies such as data extraction, data storage and management, data presentation according to its work process. Data warehouse is location of data for storage and analysis. OLAP is a technology allowing client applications to access these data effectively (LI & Chu).
Title: On-Line Analytical Processing (OLAP)
Author: California Software Labs
Performance of OLAP depends up on these things: Aggregations Materializing aggregations usually lead to a faster query response since we probably need to do less work to answer a request for cell values. Partitions: Partitions give you the ability to choose different storage strategies to optimization the tradeoff between processing and querying performance. Data slices on partitions: Setting a data slice is an efficient way to avoid querying irrevelant partitions. When users are requesting access to large amounts of historical information for reporting purposes, we should strongly consider a ware house or mart. The user will be benefit when the information is organized in an efficient manner for this type of access. A data warehouse is often used as the basis for a decision support system. Difficulties often encountered when OLTP databases are used for online analysis include the following: Analyst does not have the technical expertise required to create ad hoc queries against the complex data structure. Analytical queries that summarize large volumes of data adversely affect the ability of the system to respond to online transactions. System performance when responding to complex analysis queries can be slow or unpredictable, providing inadequate support to online analytical users. Constantly changing data interfaces with the consistency of analytical information. Security becomes more complicated when online analysis is combined with online transaction processing.
Data warehousing provides one of the keys to solving these problems, organizing data for the purpose of analysis (California Software Labs).
Title: Data Warehousing, Data Mining, OLAP, and OLTP Technologies are essential elements to support decision making process in industries
Author: G. Satyanarayana Reddy
An operational database is designed and tuned form known tasks and workloads, such as indexing using primary keys, searching for particular records and optimizing canned queries. As data warehouse queries are often complex they involve the computation of large groups of data at summarized levels and may require the use of special data organization access and implementation methods based on multi dimensional views (Reddy, Srinivasu, Rao, & Rikkula). .
How to extract data mining inside the warehouse through data analysis has several kinks that would be talked about in regards to sifting through data. A good model through ER Diagrams, queries, and visualization is great for data analysis.
Title: Unknown
Author: Unknown
Data extraction is both a political and a technical problem because it requires opening up existing systems with structural and semantically discrepancies. In essence, methodologies proposed for schema integration and multidatabase systems can be applied to cleanse the data. The problem is complicated by the fact that multiple often incompatible, aggregation hierachies are used both in the target and in the source systems.
Relational DBMSs tuned for the operational environment cannot be directly transplanted into a DW system. For example since most queries are read only, the complex concurrency control mechanism used in the operational DB is overkill.
Title: Research Issues in Data Warehouse
Author: Wu
A concern regarding data warehouses are: what technologies do we still need for data warehousing? According to Wu and Buchmann they look into modeling issues which makes since because that is the first step of constructing a data warehouse. Many authors argue about the advantages and disadvantages of multidimensional and relational OLAP. A multidimensional OLAP system consists of a multidimensional Data warehouse with the OLAP system consists of a relational data warehouse and an OLAP engine with multidimensionality. Misunderstandings are found in some of the articles according to Wu and Buchmann they state: “Entity Relationship modeling is the heart of the relational model. The explicit relationships between customers and sales orders or between hamburgers and buns are burned into the design of the database.” I strongly agree with that it is essential for all developers to have a strong ER model that will make your database very effective (Wu & Buchman).
Title: Chapter 12 Data Warehousing Framework
Author: Microsoft
OLAP is an increasingly popular technology that can dramatically improve business analysis. Historically, OLAP has been characterized by expensive tools, difficult implementation, and inflexible deployment (Microsoft).
The third question is how can you sift through data effectively through data analysis, the answer is complex and a few answers to that is raised. It is not a straight forward answer.
Title: KDD for Science Data Analysis: Issues and Examples
Author Usama Fayyad:
There is a rapidly widening gap between data collection capabilities and the ability of scientist to analyze the data. By reducing data, a scientist is effectively bringing it down in size to a range that is analyzable. The authors believe that data mining and knowledge discovery in database techniques have an important role to play (Fayyad, Haussler, & Stolorz, KDD for Science Data Analysis: Issues and Examples, 1996).
Title: Data Mining Issues
Author: ucla.edu
Data mining makes it possible to analyze routine business transaction and glean a significant amount of information about individuals buying habits. Data analysis can only be as good as the data that is being analyzed. A key challenge is redundant data from different sources (Data Mining Issues).
Title: Taxonomy of Dirty Data
Author: Won Kim
Data analysis applications are applied against any data, the data must be cleansed to remove or repair dirty data. Further, data from legacy data sources do not even have metadata (Kim, Choi, Kim, & Doheon, 2003).
Title: An Interactive Visualization Environment for Data Exploration
Author: Mark Derthick
Exploratory data analysis is referred to as queries. For KDD we want to reuse queries on different data. The problem is application used in the KDD process. With very few expectations, the user is forced to manage the transmission of data form one application to another increasing the overhead of exploration (Derthick, Kolojejchick, & Roth, An Interactive Visualization Enviroment for Dat Exploration, 1997).
Title: Towards Development of Solution for Business Process Oriented Data Analysis
Author: M Klimavicius
Data warehouse models are available for multidimensional modeling and ETL process modeling. These researchers solves particular problem in data warehouse development lifecycle, but do not address the link to business processes (Klimavicuus, 2008)
Title: A user driven data warehouse evolution approach for concurrent personalized analysis need.
A data warehouse must be adapted to any changes which occur in the underlying data sources: changes of the schemata. The analyst must change their requirements (Bentayeb, Favre, & Boussaid, 2008).
Title: From Data Mining to Knowledge Discovery in Database
KDD is used for data analysis and sifting through data in data mining. The problem is mapping low level data too voluminous to understand and digest easily (Fayyad, Shapiro, & Smyth, From Data Mining to Knowledge Discovery in Databases, 1996).
Finally the last a pending question arises, is data mining ethical? We will look into what the experts have to say about this subject.
Title: Data Mining
Author: Doug Alexander
The drop in price of data storage has given companies willing to make the investment a tremendous resource: Data about their customers and potential customers stored in Data Warehouses. Data mining or knowledge discovery is the computer assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge driven decisions. Data mining derives its name from similarities between searching for valuable information in a large database and mining a mountain for a vein of valuable ore (Alexander).
Title: No Free Lunch: Price Premium for Privacy Seal – Bearing Vendors.
Author:
Businesses use customer data in several ways that could lead to privacy concerns. First, many businesses use customer data to direct unsolicited promotions to their customers. When the customer is not given notice that he or she would receive offers in the future, then such solicitations intrude against the customer’s right to be left alone. In the online environment, the cost for conducting such promotions is low, therefore making spamming much more prevalent. However advanced data mining techniques enable sellers to profile and price discriminate customers using the collected data. To reduce the information gap between businesses and customers in the privacy context, businesses could acquire privacy seals granted by third parties (Mai, Menon, & Sarkar, 2010).
Title: Data mining: proprietary rights people and proposals
Author:
The increased use of data mining raises many concerns regarding privacy. There is a growing concern among consumers that the right to privacy is being eroded by the increased sophistication of data collection and mining practices by both corporations and government entities. Exacerbating the consumers’ concern is the question of ownership of the consumer’s personal data. The subject of the information search does not have control over his own data, and yet it is he who will suffer from the inaccurate or incorrect assessment. These problems are encompassed in the larger question of privacy rights of those about whom the information is gathered.
Data mining itself is not ethically problematic. The ethical and legal dilemmas arise when mining is executed over data of a personal nature. Perhaps the most immediately apparent of these is the invasion of privacy. Complete privacy is not an inherent part of any society because participation in a society necessitates communication and negotiation which renders absolute privacy unattainable. Hence individual members of a society develop an independent and unique perception of their own privacy. An individual can maintain their privacy by limiting their accessibility to others. In some contexts, this is best achieved by restricting the availability of their personal information (Data mining: proprietary rights people and proposals, 2009) .
.
Chapter 3: ER Diagramming
The entity relationship model has been introduced by P.P. Chen in 1976 as a generalization of the network model formalization of C. Bachmann. The model conceptualizes and graphically represents structuring of the relational model and is currently used as the main conceptual model. A large number of extensions to this model have been proposed in the 80ies and 90ies due to its extensive usage. The extended entity relationship model is mainly used as a language for conceptualization of the structure of information systems applications.
Conceptualization of database or information systems aims in a representation of the logical and physical structure of an information system in a given database management system, so that it contains all the information required by the user and required for the efficient behavior of the whole information system for all users. Furthermore, conceptualization may target to specify the database application processes and the user interaction. Description of structuring is currently the main use of the extended ER model (Thalheim) .
Experts in ER modeling will readily admit that data modeling is more an art than a science. Although the transformation of an ER diagram to a relational schema follows a set of well defined, straight forward rules, errors in an ER diagram can lead to normalization problems which the transformation rules fail to capture. In general, there are two classes of ER modeling errors that lead to normalization problems: 1) the incomplete data model error 2) the mis modeled problem domain error. The incomplete data model error tends to occur in situations where the systems analyst is tasked to build a computer based information system that is limited in scope. A key objective for successful information systems project management is the definition of a limited, yet adequate project scope, a scope that enables the production of system deliverables within a reasonable time period. Limiting a project’s scope often results in information systems that are based on limited data models. Limited information systems are fairly common throughout the IS world where dissimilar technologies prevent data sharing and work against the concept of a shared, enterprise wide database. The mis model problem domain error is actually a class of errors including those that arise whenever systems analyst lacks a completed understanding of the problem domain (Bock, 1997).
When discussing what makes an efficient ER diagram one has to be briefed on normalization. Normalization is a major part of data modeling. It is the glue to hold up databases and data warehouses. The reason that ER diagrams is mentioned or is a major part of this paper, developers have to have a data model or ER diagram to start constructing data warehousing or data mining. E.F. Codd, the acknowledged father of relational database and normalization, initially defined normalization as the “very simple elimination procedure” to remove non simple domains from relations. A relation with simple domains is one whose elements are atomic or non decomposable. In other words, the attributes in a relation with simple domains are properly tied only to the attributes that uniquely identifies all others. Hoffer state it well enough as “Normalization is the process of successively reducing relations with anomalities to produce small well structured relations.” A properly normalized relation with a couple dozen attributes is not as small as a normalized relation with a half dozen attributes. Both large and small relations are a part of the reality in corporate information structures. However, Hoffer goes on to list as some of the main goals of normalization: 1) Minimize data redundancy, thereby avoiding anomalies and conserving storage space. 2) Simplify the enforcement of referential integrity constraints. 3) Make it easier to maintain data 3) make it easier to maintain data. 4) Provide a better design that is an improved representation of the real world and a stronger basis for future growth. Indeed those might be outcomes and benefits ones that are achieved to varying degrees from one enterprise information structure to the next. However, the real goal of normalization remains as it was in 1970, to produce correctly structured relations. Normalization is the process for evaluating and correcting table structures to minimize data redundancies, thereby reducing the likelihood of data anomalies. First we analyze relations rather than tables as a table doesn’t qualify to be a relation unless it is shown to be in at least the first normal form. Second we aim to eliminate modification anomalies rather than reducing data anomalies. Those two are admittedly somewhat nitpicky objections. Third and more problematic though is the notion that normalization is intended to minimize data redundancies. Codd noted that Redundancy in the named set f stored set of representations. We are primarily concerned here with the former. It is the latter that is properly referred to as data redundancies while the former could be better called structure redundancies.
Moreover Codd went on to describe both strong and weak redundancies. Again, it is not the purpose of this paper to thoroughly explain those. Suffice it to again quote Codd about the significance of each of those. “An important reason for the existence of strong redundancies in the named set of relationship is user convenience. “ Generally speaking, weak redundancies are inherent in the logical needs of the community of users. They are not removable by the system or data base administrator. Thus a certain amount of structure redundancy is desirable. Normalization if properly applied to relations will assure that structure redundancies are properly sated. A common example of structure redundancy is the placement of a foreign key into a relation. Since a foreign key is a primary key and an attribute in some other relations, the placement of a foreign key creates redundancy in the overall database structure. Through normalization also can minimize the chances that unnessary data redundancy might occur. Typically, however, protection against data redundancy is enforced via integrity controls imposed when relations are implemented physically. Such protection notwithstanding data redundancies can be part of the physical design even in a highly normalized set of relations (Carpenter).
A common error that novice designers make is failing to recognize the boundaries of a problem domain. They fail to make a distinction between elements that comprise the content of the database and elements that are outside the scope of the database. Novice designers also frequently confuse entities with their attributes or properties. Occasionally if properties are complex and play a significant role in the problem domain hen they may be modeled as entities.
Other errors are modeling indirect or redundant relationships and inappropriately modeling object types as relationships rather than as entities. In this case, the indirect relationship simply becomes redundant (Song & Froehlich).
Generally, in the design of DBMS it is assumed that the conceptual schema is developed first, and then any external schemas are derived from the conceptual schema. However except for new applications, this order of precedence in design is not necessary. There are often external schemas that exist before the design of the conceptual schema. In the relational model, the conceptual schema consists of all the relations of the database and the external schema consists of the relations corresponding to a particular application. The schema design is basically equated to database normalization. This can accomplished through synthesis or decomposition (AL-Fedaghi & Scheurmann, 1981).
Chapter 4: OLAP
OLAP and data visualization Marieetti considers the heart of the data warehouse to be the analytical database. The benefits of data warehousing include immediate information delivery, the ability to do trend and outcome analysis, query and report capabilities, and the ability to do trend and outcome analysis, query and report capabilities, and the ability to integrate data from multiple sources (Grant) .
Databases no longer meet all the growing trend of large volumes of data to be processed and analyzed, but specialist has implemented a new concept that is called data warehouse. At first glance on can think that a warehouse is a collection of database, which is not completely false, just that data have gone through a process of adding value involving operations of synthesis, analysis and interpretation. This process involves data filtering, sorting queries, implementation of various function or pivot table like operations. Therefore warehouse store data, organize, group and correlate volumes of data terabytes to obtain synthetic reports on which a manager bases its decision.
Processing large volumes of data requires special engines and services for OLAP or multidimensional processing data. Such a warehouse is called business data warehouse and can be implemented on mainframes, on UNIX super servers, or on platforms with parallel architectures.
Analysis phase involves a thorough study of the main activity which allows updating and manipulating data in warehouse without affecting existing data in a reasonable time. The process of creating a data warehouse is continuous, which means that, can always be added new data marts or data repositories. Data must be filtered and processed before storage, so that data warehouse to retrieve only which are the most frequent and important queries the data warehouse have to respond to (Onis & Bucea-Manea, 2010).
In query specifications the user often invokes sequences of OLAP operations interactively by starting from some basic cube. Each operation yields a resulting cube serving as the basis for the next operation. Obviously not all intermediate cubes are of interest to the user. Therefore an alternative the stepwise specification style is needed.
The CUBE operation is based on the most straightforward way to model multidimensional data. Its origin is in the relational framework and it has a single operand relation consisting only of dimension and measure attributes. Shukla and others extended the original dimension attributes in the modeling of a multi dimensional cube. The background assumption of the CUBE operation is that one powerful relational operation is sufficient for multidimensional manipulation. Therefore this operation is included as a standard feature in the relational query language SQL (Niemi & Hirvonen, 2003).
The researcher has learned that OLAP is definitely harnessed for data analysis Efficient OLAP and learning how to extract data from data mining and data warehousing is interchangeable.
I will now prove what makes an efficient OLAP. According an article by Nepomjashly It discusses the 12 rules of OLAP which is always the premise for data warehouse requirements. . In a basis of OLAP concept lays the principle of a multidimensional data presentation. "Father of the relational theory" doctor E.F. Codd has considered disadvantages of relational model, first having showed impossibility "to unite, view and parse the data viewed from multiplicity of measurements, that is the most understandable for corporate analysts a method", and has defined common requirements to OLAP tools which are expanding functionality of relational DBMS and switching on the multidimensional analysis as one performance.
These 12 rules (according to Codd) which should satisfy the OLAP software are:
- Multi-Dimensional Conceptual View: Business-analyst "sees the company world" multivariate and multi-dimensional, accordingly and conceptual data model representation in OLAP product should be multivariate and multi-dimensional on a nature that allows analysts to fulfill intuitive operations: "slice and dice", rotate and pivot directions of consolidation.
- Transparency: The user should not know what concrete resources are used for storage and data processing and how the data are organized. Without dependence from that, the OLAP-product a part of resources of the user is whether or not, this fact should be transparent for the user. If OLAP it is granted by client - server calculations this fact also, whenever possible, should be imperceptible for the user. OLAP should be granted in a context of open architecture, allowing the user where he was to communicate through the analytical tool with the server. In addition transparency should be achieved in interaction of the analytical tool with homogeneous / heterogeneous databases.
- Accessibility: Business analyst should have a possibility to analyze within the framework of the common conceptual scheme, thus the data may remain under the control of old, "inherited" DBMS, being thus pegged to common analytical model. So OLAP tool kit should superimpose its own logic scheme on physical data arrays, fulfilling all conversions required for support of a uniform, agreed and complete "user sight" on the information.
- Consistent Reporting Performance: With increasing of numbers of measures and database size analysts should not face with any decrease of productivity. Stable productivity is necessary for maintaining a usage simplicity which is required for finishing OLAP up to the end user. If the user - analyst will test essential distinctions in productivity according to number of measures then he will try to compensate these distinctions the strategy of development that will call data representation other ways, but not with what it is really necessary to present the data. Costs of time to bypass the system for compensation of its inadequacy are not what analytical products are intended for.
- Client-Server Architecture: Large data volumes required operating analytical processing stored on mainframes, but extracted from PC. Therefore one of requests - ability of OLAP products to operate in client - server environment. Main idea here is that OLAP tool server component should be intelligent enough and can build the common conceptual scheme based on generalization and consolidations of various logical and physical schemes of corporate databases.
- Generic Dimensionality: All measures should be equivalent. Additional performances may be given to separate measures, but as all of them are symmetric, the given additional functionality may be given to any measure. Base data structure, formulas and report formats should not base on any one measurement and should not be displaced aside to any measure. Each measure should be applied irrespectively to its structure and operational abilities. Additional operational abilities may be granted to any selected measure, and as measures are symmetric, any function may be given to any measure.
- Dynamic Sparse Matrix Handling: OLAP tool should guarantee optimal processing of the sparse matrixes. Access speed should be saved without dependence from data cells layout and to be a constant for the models having different number of measures and different data sparse.
- Multi-User Support: Frequently some analysts have the necessity to work simultaneously with one analytical model or to create various models based on the same data. OLAP tool should grant them competitive access, guarantee integrity and data protection.
- Unrestricted Cross-dimensional operations: Data calculation and manipulation on any number of measures should not prohibit or limit any ratios among data cells. The conversions requiring arbitrary definition should be set in functionally complete formula language.
- Intuitive Data Manipulation: Directions consolidation, detailing data in columns and rows, aggregation and other data manipulations inherent to hierarchy structure, should be executed in maximum convenient, natural and comfortable user interface.
- Flexible Reporting: Various data visualization methods should be supported; other word reports should be presented in any possible orientation.
- Unlimited Dimensions and Aggregation Levels: Strongly recommended, that each serious OLAP tool should have a minimum of 15 (better more than 20 measures in analytical model. Moreover, each of these measures should admit practically unlimited amount of aggregation levels, defined by user, on any direction of consolidation.
It is necessary to consider this set of requests, being the actual definition of OLAP as recommendations, and evaluate concrete OLAP-products on degree of approximation to correspondence to all above 12 rules (Nepomjashly).
Chapter 5: Data Analysis
In order to implement data analysis in a data warehouse or data mining atmosphere one has to have an OLAP. Another factor is KDD can be used for data mining as well. ETL could be used but the researcher will leave ETL out, and focus on OLAP and KDD.
OLAP or multidimensional data analysis seeks to support decision making based on multi dimensionally organized summary data. Conventional database systems have mainly been developed to support OLTP applications, which usually are related to operational task in organizations. The OLAP approach offers a single source, a multidimensional database to support advanced decision making (Niemi & Hirvonen, 2003).
It is assumed in multidimensional data analysis that a decision maker needs summary data related to a specific subject and he must consider that data in respect of certain factors. Data analysis often need to group data they want to consider dimensions at different levels of detail. Most of today’s analysts do not master programming database techniques. Therefore, it is important to develop high level and intuitive declarative OLAP interfaces for them (Niemi & Hirvonen, 2003).
OLAP is used to summarize, consolidate, view, apply, formulae to, and synthesize data according to multiple dimensions. Queries posed on such systems are quite complex and require different views of data. Data mining can be viewed as an automated application of algorithms to detect patterns and extract knowledge from data. Data mining is a step in the overall concept of knowledge discovery in databases (KDD). Large data sets are analyzed for searching patterns and discovering rules. Automated techniques of data mining can make OLAP more useful and easier to apply in the overall scheme of decision support systems. Data mining techniques like Associations, Classification, Clustering and Trend analysis can be used together with OLAP to discover knowledge from data.
Typically, large amounts of data are analyzed for OLAP and data mining applications. Ad hoc analytical queries are posed by analyst who expects the system to provide real time performance.
Raw data stored in database are seldom of direct use. In practical applications data are usually presented to the user in a modified form, tailored to satisfy specific business needs. Even then people must analyze data more or less manually acting as sophisticated query processors. This may be satisfactory if the total amount of data being analyzed is relatively small, but is unacceptable for large amounts of data. What is needed in such a case is an automation of data analysis tasks. That’s exactly what KDD a DM provides. They help people improve efficiency of the data analysis they perform. They also make possible for people to become aware of some useful facts and relations that hold among the data they analyze and that could not be known otherwise, simply because of the overload caused by heaps of data.
Dramatic improvements in information technology have encouraged the massive collection and storage of data in all areas from commerce to research. From operational databases where personnel data are kept; to transactional systems that track sales, inventory and patron data; to full text document databases and more; databases are growing in size, number, and application. The enormous increase in database of all sizes and designs is evidence of our ability to collect data, but it also creates the necessity for better methods to access and analyze data. Human capacity to handle the data available in these databases is not adequate for timely examination and analysis. Technology presents opportunities to maximize the use of these data in an ecomonimical and timely fashion. Attempts to improve the search and discovery processes when dealing with databases have generated significant interest across many fields resulting in a multidisciplinary approach. Knowledge discovery in databases have generated significant interest across many fields resulting in a multidisciplinary approach. Knowledge discovery in databases employs diverse fields of interest including statistics, computer science, and business, as well as an array of recognition, artificial intelligence, knowledge acquisition for expert systems and more.
Knowledge discovery in database encompasses all the processes, both automated and no automated, that enhance or enable the exploration of databases, large and small, to extract potential knowledge. The most commonly referenced component of these processes has been data mining which involves activities oriented toward identifying patterns or models in data representation, classification, semantics, rules application, and so on.
Emphasis that KDD is a whole process is intended to clarify that knowledge seeking in data collections involves intellectual and technological undertaking designed to seek useful knowledge and not merely stir data. Certain basic premise underlie these information bases and need.2) finding patterns in data is not equivalent to discovering information; 3) data mining, to be effective must be structured; 4) results of any discovery activity have to evaluated within a context; 5) search interation; and 6) many aspects of KDD are dynamic and interactive in application.
Intelligent data analysis techniques are still not sophisticated enough to resolve some data problems without appropriate to include in a database, adding it to the classification and organizational scheme of the database, and providing access points for retrieval are neither trivial nor uniform. Design and implementation of database has relied on the purpose, scope, data characteristics, and technical limitations of the organization sponsoring the enterprise. The vitality of this database has been dependent on the imposition of appropriate criteria for inclusion, characterization, organization task which are common to each discovery effort and variations in the construction and quality of the data accommodated (Norton, 1999).
Chapter 6: Data Mining Ethics
Data mining and direct marketing are beneficial to the business community because they enable businesses to identify more accurately the target audience for their product or service, thereby reducing marketing costs. But a case must be made that this business practice is ethically justified in light of the potential privacy loss of individuals who transact with a business over the Internet as well as general Internet users. Assuming the economic benefits of this strategy, in the first section we provide the presumptive ethical case in favor of data mining and direct marketing. From the consumer's perspective, the practices of data mining and direct marketing appear to be ethically justified because they can be beneficial to the consumer, insofar as marketing costs reduce product costs and marketing and advertising becomes more tailored to the individual's interests. This is essentially a utilitarian justification of lower costs and individual attention outweighing potential invasions of privacy (Morse & Morse, 2002).
From the perspective of the consumer, the practices of data mining and direct marketing appear to be ethically justified because they can be economically beneficial, primarily in two ways. One, in being tracked, Internet users' general interests have been carefully identified. When this information is combined with the relevant information about past purchases, income level, etc., marketers and advertisers possess a better understanding of the products or services about which a specific consumer might have some interest. Advertisers can target specific consumers in a more direct manner, so subsequently a consumer will learn in a more direct manner about those products that would interest her. The consumer benefit here is a reduction in the amount of "junk mail" and unnecessary advertisements she receives through a less direct marketing approach. Cattapan notes that "consumers will no longer be bothered by hundreds of unnecessary bothersome ads. Instead they will receive messages relevant to them."Furthermore, arguing that one of the roles of the cookie is to track the number of visits to banner ads on web sites, Cattapan explains that data about banner visitation can be helpful in reducing the number of banners, or, as above, identifying the appropriate banners given the demographics of a certain web site. In summary, the first benefit of data mining and direct marketing is, ideally, that the relevant data allow marketers to better understand the interests and purchasing behavior of Internet users. Presuming that Internet users do not enjoy being bombarded with information about which they have no interest, it is advantageous to marketers, advertisers, and consumers that the marketing and advertising process can be streamlined.
An overwhelming 86 percent of Web users in the United States reported concern about others gathering personal information about them. Internet companies claim this information is used beneficially to customize Web site content to individual interests. However, only 27 percent of respondents agreed that information is collected for users' benefit, while 54 percent viewed such practices as harmful.
The privacy controversy over collecting information on the Internet arises from the far-reaching, unprecedented capability to collect more detailed information and disseminate greater quantities of information. Once information is disclosed, the user loses control over it. The result is a potential for misuse, through secondary use of the information, either by the party who collected it or by a third party who purchases or otherwise obtains it. Further contributing to the privacy concern is the surreptitious nature of involuntary information disclosure on the Web. LOSS OF ANONYMITY
Consumers may erroneously assume that Web interaction is anonymous. The capability to personally identify Web users, without their knowledge, is afforded through various technological tools, used alone or in combination with other methods. Even when the actual identity of the user is not known, a random identifier, such as a unique numeric user code, associates information collected. The potential for personal identification is present should a user register, make a purchase, or otherwise provide information such as name, e-mail address, postal address, or telephone number. The random number can easily be associated with such personal information.
UNINTENDED USES OF DATA COLLECTED
Does any use of data collected, other than the original purpose of gathering that data constitute unintended use? The underlying motivation for the various technological tools is to enhance the user's Web experience and better serve consumers' needs and desires. The determination of what use is unintended use is not so straightforward. In using cookies, for example, the on-screen visual cue of dimming the highlighting for a hot-spot after a user has selected it, seems to be within the realm of intended use. An example that is not so clear entails tracking user movements to record pages viewed the sequence of viewing, and the time duration of viewing for the purpose of designing a custom interaction for that user's next visit. A more clear-cut example of unintended use is tracking navigation to determine what, and when, advertisements should be presented. The various technological tools were not originally intended for the benefit of marketers and advertisers.
SURREPTITIOUS DATA COLLECTION
Ethically, Web sites have an obligation to consumers to obtain informed consent for the collection and use of personal information. However, in the commercially competitive environment of E-commerce, information gathering may be undertaken without consumers' knowledge or permission. The mere awareness, on the part of a Web user, of the existence of data collection may impart an eerie feeling during real-time interaction. The knowledge that someone, somewhere, may be surreptitiously tracking every click of the mouse and every keystroke during Web navigation can be unsettling. Perhaps being both informed when data is being collected and presented with the opportunity to grant permission could remove the stealth reputation of these activities.
Notification and optional acceptance do not resolve the issue of surreptitious data collection. Consumers are currently still not always informed of what specific information is collected and for what purpose. To overcome the surreptitious reputation, some Web sites are posting their privacy policy to inform users about whether and how data is collected.
TRESPASSING INTO WEB USERS' RESOURCES
Web publishers use a visitor's own Web browser to transfer data, and the visitor's hard drive to store and retrieve files. The transfer is undetectable. The idea of others storing data on an individual's hard drive may be unpalatable. Further, the Web provider utilizes the user's resources potentially without the user's knowledge and certainly without the user's explicit permission. Files containing data, such as cookies files, may be deleted, but this requires an uninvited time commitment or monetary investment if specialized software such as cookie managers is purchased.
PROTECTING CONSUMER PRIVACY ON THE INTERNET
The growth in E-commerce has been accompanied by an increase in consumer awareness and concern about the privacy of personal data collected on the Internet. This concern is likely to translate into lost retail E-commerce sales. If privacy concerns are not addressed, E-commerce will not reach its full potential and consumers will not gain the confidence necessary to fully participate in the electronic marketplace
CONSUMER ACTION
Consumers can take precautionary actions to reduce the potential of data being collected about them on the Internet. Consumers should read and understand the posted privacy policies of the Web sites they visit. Web sites with no posted policy or a policy with which the consumer disagrees should be avoided. Similarly, parents can control what their children access and the personal information they volunteer, either by parental supervision or installing protective software to block unacceptable sites. Various types of protective software can be installed based on consumers' desires to, for example, block access to selected Web sites, block ads and pop-up windows, or eliminate cookies and Web bugs. Consumers, through their own action, can thereby take some control in protecting their own privacy, rather than rely on businesses or the government to afford privacy protection.
U.S. GOVERNMENT ACTION
Since the 1930s, the U.S. Government has been involved in regulating the commercial environment. In response to both commercial interests and privacy concerns of consumers, legislation has been enacted within the U.S. legislation applicable to the privacy concerns of consumers.
The FTC Act is intended to ensure that companies uphold their promises to consumers, including those regarding privacy. The FTC Act prohibits "unfair methods of competition" and "unfair and deceptive acts or practices in and affecting commerce." Further, the FTC is specifically charged with protecting the privacy of children through the Children's Online Privacy Protection Act (COPPA), which became effective in April 2000.
Legislation directed toward specific industries has been enacted, including the Gramm-Leach-Bliley (GLB) Financial Services Act, the Federal Credit Reporting Act (FCRA), and the Health Insurance Portability and Accountability Act (HIPAA) of 1996. Effective July 1, 2001, the GLB Act applies to consumer information collected in the fields of banking, credit, and insurance. Financial institutions, as defined in GLB, are required to disclose their privacy policies, and the GLB Act allows consumers, through an "opt-out" option, to disallow their financial institutions from sharing personal financial information with nonaffiliated third parties. Further, financial institutions must have policies for protecting against unauthorized access to consumers' personal information. The FCRA, which was enacted in 1970, limits the purposes for which a consumer credit report can be obtained or provided. The HIPAA is a collection of requirements, directed at patient medical records, which mandates the establishment of privacy protections for health-care information. Organizations maintaining or transmitting health information are required to undertake reasonable and appropriate administrative, technical, and physical safeguards.
Two acts directed specifically at telecommunications and computer use were enacted in 1986. The Electronic Communications Privacy Act (ECPA) prohibits the interception of electronic communications and unauthorized access to stored electronic communications. The Computer Fraud and Abuse Act (CFAA) provide both civil and criminal protection against intentional unauthorized access to a computer, through interstate or foreign communication, to obtain information or to cause damage.
At the close of the 107th U.S. Congress, in session from 2000 to 2002, more than 50 bills relating to privacy were introduced, many of which targeted online privacy. In the 108th Congress currently in session, there are multiple proposed bills that impact on various information and Internet privacy issues. These bills are currently being addressed at the committee level of the House of Representatives and the Senate (Sipior & Ward, 2004).
Perhaps the challenge that KDD and data mining pose for normative privacy is best expressed by Fulda when he asks: Is it possible for data that does not itself deserve legal protection to contain implicit knowledge that does deserve legal protection and, if so, what balance must be struck between freedom to use whatever knowledge one has at one’s disposal to further one’s own ends and the freedom not to have one personal data mined into knowledge that will be used as a means to someone’s else’s ends.
Essentially Fulda poses two separate and key questions for us to consider: 1) does data that contains implicit knowledge about persons deserve legal or perhaps some other kind of normative protection; and 2) assuming that it does how can we frame a coherent policy that would balance the interest of data subjects (Tavani, 1999).
References:
Alexander, D. (n.d.). Data Mining. Retrieved April 2011, from utexas.edu: http://www.laits.utexas.edu/~norman/BUS.FOR/course.mat/Alex/
AL-Fedaghi, S., & Scheurmann, P. (1981). Mapping Considerations in the Design of Schemas for the Relational Model. IEEE Transactions on Software Engineering , SE-7 (1), 101.
Bentayeb, F., Favre, C., & Boussaid, O. (2008). A user-driven data warehouse evolution approach for concurrent personalized analysis needs. Integrated Computer Aided Engineering , 21-25.
Bock, D. B. (1997). Entity Relationship Modeling and Normalization errors. Journal of Database Management , 1-9.
Bureau, U. C. (2009). http://www.census.gov/compendia/statab/. Retrieved from Statistical Abstract 2009.
California Software Labs. On Line Analytical Processing ( OLAP).
Carpenter, D. A. Clarifying Normalization. Journal of Information System Education , 19 (4), 379-380.
Cary, C. H. (2003). Data mining: Consumer privacy, ethical policy, and systems development practices. Human Systems Management , 22 (4).
Chilton, M. A. Data Modeling Using Entity Relationship Diagrams: A Step Wise Method. Journal of Information Systems Education , 17 (4).
Chisholm, M. (2007, October). Data Models are Not Database Design. DM Review , p. 45.
Clark, M. R. (2004). A golden vein. Economist , 371 (8379), 22-23.
Corbitt, T. (2006, April). The Power of Data Mining and Warehousing. Credit Management , 32-33.
Data Mining Issues. (n.d.). Retrieved April 13, 2011, from ucla.edu: http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologie...
Data mining: proprietary rights people and proposals. (2009). Business Ethics: A European Review , 18 (3), 245-247.
Derthick, M., Kolojejchick, J., & Roth, S. (1997). An Interactive Visualization Enviroment for Dat Exploration. AAAI Press .
Derthick, M., Kolojejchick, J., & Roth, S. (1997). An Interactive Visualization Environment for Data Exploration. Proceedings of Knowledge Discovery in Databases, AAAI Press , 2-9.
Ellis, P. D. (2005). Research and Strategic Communication. Upper Saddle River, New Jersey: Pearson Education Inc.
Fayyad, U., Haussler, D., & Stolorz, P. (1996). KDD for Science Data Analysis: Issues and Examples. AAAI.org , 50.
Fayyad, U., Haussler, D., & Stolorz, P. (1996). KDD for Science Data Analysis: Issues and Examples. AAAI , 50.
Fayyad, U., Shapiro, G., & Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases. AI Magazine , 17 (3).
Glass, K. (2008). Internet privacy best protected by industry , FTC say. Knight Ridder Tribune Washington Bureau , 1-2.
Grant, G. ERP& Data Warehousing in Organizations:Issues and Challenges. Canada: Careton University.
Hay, D. C. (1998, Winter). Making data models readable. Information Systems Management , 15 (1), pp. 13-21.
James, B. (2011, May). The Importance of Business Understanding in Requirements Structuring. Retrieved May 2011, from umsl.edu: http://www.umsl.edu/~bcjtz4/umsl/er_diagrams.html
James, B. (2011, April 20). The Importance of Business Understanding in Requirements Structuring. Retrieved May 2011, from umsl.edu: http://www.umsl.edu/~bcjtz4/umsl/er_diagrams.html
Jukie, N. (2006, April). Modeling Strategies and alternatives for data warehousing projects. Communications of the ACM , p. 83.
Kim, W., Choi, B. J., Kim, S.-K., & Doheon, L. (2003). A Taxonomy of Dirty Data. Data Mining and Knowledge Discovery .
Klimavicuus. (2008). Towards Development of Solution for Business Process Oriented Data Analysis. Proceeding of World Academy of Science , Engineering and Technology , 27.
LABS, C. S. (n.d.). OnLine Analytical Processing (OLAP). CALIFORNIA SOFTWARE LABS .
Laczniak, G. R., & Murphy, P. E. (2006). MARKETING, CONSUMERS AND TECHNOLOGY: PERSPECTIVES FOR ENHANCING ETHICAL TRANSACTIONS. Business Ethics Quarterly , Vol. 16 (3), 315-316.
Li, H., & Chu, J. (n.d.). Power analysis system based on data warehouse. Computer Department of Shandong University .
LI, H., & Chu, J. (n.d.). Power analysis system based on data warehouse. Shandong Providence, China.
Mai, B., Menon, N., & Sarkar, S. (2010). No Free Lunch: Price Premium for Privacy Seal Bearing Vendors. Journal of Management Information Systems .
Microsoft. (n.d.). Chapter 12 Data Warehousing Framework. Retrieved April 20, 2011, from microsoft.com: http://technet.microsoft.com/en-us/library/cc966470.aspx
Mining Customer. (1999). Sales & Marketing Management , 151 (12).
Morse, J., & Morse, S. (2002). Teaching Temperance to the "Cookie Monster": Ethical Challenges to Data Mining and Direct Marketing. Business and Society Review , 107 (1), 76-97.
Nepomjashly, A. (n.d.). OLAP and Data Warehousing (12 Rules).
Niemi, T., & Hirvonen, L. (2003). Multidimensional Data Model and Query Language for Informetrics. Journal of the American Society for Information Science and technology , 939-941.
Norton, M. (1999). Knowledge discovery in databases. Library Trends , 48 (1), 9-21.
Onis, R., & Bucea-Manea. (2010). Oracle Cube Maker for SMES. The Bucharest Academy of Economic Studies, Romania , 9 (1), 146-165.
Peretz Shoval, R. D. (2005). Hierarchical entity-relationship diagrams: the model, method of creation and experiemental evaluation. Department of Information Systems Engineering , 217.
Politano, A. (2001). Salvaging information engineering techniques in a data warehouses environment. Computer Technology Review , 21 (2), 53.
Pollach, I. (2996). Privacy Statements as a Means of Uncertainty Reduction in WWW Interactions. Jouranl of Orgainztional and End User Computing , 1-15.
Rawlings, I. (1999, Sept). Using data mining and warehousing for knowledge discovery. Computer Technology Review , 19 (9), p. 20.
Reddy, S., Srinivasu, R., Rao, M., & Rikkula, S. (n.d.). Data warehousing, data mining , OLAP and OLTP Technologies are essential elements to support deciesion making process in industries1. International Journal on Computer Science and Engineering , 2869.
Reingruber, M., & Gregory, W. (1994). The Data Modeling Handbook: A best Practice Approach to Buidling Quality Data Models. John Wiley & Sons Inc.
Scalzo, B. (2008). TechRebuplic. Retrieved from TechRebpublic.com.
Sipior, J., & Ward, B. T. (2004). Ethics of Collection and Using Consumer Internet Data. Information System Management (1), 58-66.
Song, I. Y., & Froehlich, K. (n.d.). A Practical Guide to Entity Relationship Modeling. Drexel University , 228-231.
Tavani, H. (1999). KDD,data mining, and the challenge for normative privacy. Ethics and Information Technology .
Thalheim, B. Extended Entity Relationship Model.
Tim Chenoweth, K. C. (2006). Seven Key Interventions for Data Warehouse Success. Communications of the ACM , 49 (1), 115-117.
Tuzhilin, A. (2008, Feb). Managing Large Collections of data mining models. Communications of the ACM , p. 85.
Wu, C. M., & Buchman, A. Research Issues in Data Warehousing.
-
-
1.1 What is Virtualization?
Virtualization is the ability to run multiple virtual machines on a single piece of hardware. The hardware runs software which enables you to install multiple operating systems which are able to run simultaneously and independently, in their own secure environment, with minimal reduction in performance. Each virtual machine has its own virtual CPU, network interfaces, storage and operating system.
1.2 Why Virtualize?
With increased server provisioning in the datacenter, several factors play a role in stifling growth. Increased power and cooling costs, physical space constraints, man power and interconnection complexity all contribute significantly to the cost and feasibility of continued expansion.
Commodity hardware manufacturers have begun to address some of these concerns by shifting their design goals. Rather than focus solely on raw gigahertz performance, manufacturers have enhanced the feature sets of CPUs and chip sets to include lower wattage CPUs, multiple cores per CPU die, advanced power management, and a range of virtualization features. By employing appropriate software to enable these features, several advantages are realized:
-
Server Consolidation: By combining workloads from a number of physical hosts into a single host, a reduction in servers can be achieved and a corresponding decrease in interconnect hardware. Traditionally, these workloads would need to be specially crafted, partially isolated and well behaved, but with new virtualization techniques none of these requirements are necessary.
-
Reduction of Complexity: Infrastructure costs are massively reduced by removing the need for physical hardware, and networking. Instead of having a large number of physical computers, all networked together, consuming power and administration costs, fewer computers can be used to achieve the same goal. Administration and physical setup is less time consuming and costly.
-
Isolation: Virtual machines run in sand-boxed environments. Virtual machines cannot access the resources of other virtual machines. If one virtual machine performs poorly, or crashes, it does not affect any other virtual machine.
-
Platform Uniformity: In a virtualized environment, a broad, heterogeneous array of hardware components is distilled into a uniform set of virtual devices presented to each guest operating system. This reduces the impact across the IT organization: from support, to documentation, to tools engineering.
-
Legacy Support: With traditional bare-metal operating system installations, when the hardware vendor replaces a component of a system, the operating system vendor is required to make a corresponding change to enable the new hardware (for example, an Ethernet card). As an operating system ages, the operating system vendor may no longer provide hardware enabling updates. In a virtualized operating system, the hardware remains constant for as long as the virtual environment is in place, regardless of any changes occurring in the real hardware, including full replacement.
1.3 Xen™ Technology
The Xen hypervisor is a small, lightweight, software virtual machine monitor, for x86-compatible computers. The Xen hypervisor securely executes multiple virtual machines on one physical system. Each virtual machine has its own guest operating system with almost native performance. The Xen hypervisor was originally created by researchers at Cambridge University, and derived from work done on the Linux kernel.
The Xen hypervisor has been improved and included with Oracle VM Server.
1.4 Oracle VM
Oracle VM is a platform that provides a fully equipped environment for better leveraging the benefits of virtualization technology. Oracle VM enables you to deploy operating systems and application software within a supported virtualization environment. The components of Oracle VM are:
-
Oracle VM Manager: Provides the user interface, which is a standard ADF (Application Development Framework) web application, to manage Oracle VM Servers, virtual machines, and resources. Use Oracle VM Manager to:
-
Create virtual machines
-
Create server pools
-
Power on and off virtual machines
-
Pause and unpause live virtual machines
-
Deploy virtual machines
-
Manage virtual NICs (Network Interface Cards), disks and shared disks
-
Create virtual machine templates from virtual machines
-
Import virtual machines and templates
-
Manage high availability of Oracle VM Servers, server pools, and guest virtual machines
-
Perform live migration of virtual machines
-
Import and manage ISOs
-
-
Oracle VM Server: A self-contained virtualization environment designed to provide a lightweight, secure, server-based platform for running virtual machines. Oracle VM Server is based upon an updated version of the underlying Xen hypervisor technology, and includes Oracle VM Agent.
-
Oracle VM Agent: Installed with Oracle VM Server. Oracle VM Manager communicates with Oracle VM Agent to manage the Oracle VM Servers and virtual machines running on it.
-
-
What are OCR Engines?
Optical Character Recognition abbreviated as OCR is the software tool used to convert typed or handwritten content into machine readable, editable format. OCR engines are used to read typed (machine printed) characters. The easy and quick reading of upper/lower case letters, accented letters, symbols and punctuations are performed. The broken characters, text lines, etc are also easily recognized.
Features of Good OCR Engines
The OCR engine should be designed for industrial strength, corporate volume scanning & OCR needs. Thorough robust functionality, configurations for speed, volume, and automation are required. The most common and powerful type of OCR engine can read more stylized fonts commonly available on the desktop PC. Some OCR engines generally do not process well on fonts that are designed specifically for recognition, such as OCR-A. That is because those fonts have peculiarities that set them apart from more standard fonts. Some other OCR engines are trained specifically to read fonts such as OCR, OCR-B, and MICR as on checks. Huge dictionary, despeckle, format retention, batch retention, and easier error correction are the features to look out for in good OCR engine.
What is the Advantage of Using an OCR Engine?
The powerful design and accurate recognition features of the OCR engine make it useful for easily integrating it into mobile or hand held devices. OCR engine is available for easy download from the internet. The OCR engine ensures high speed image processing and accurate recognition capability makes it useful for unique identification, business card recognition, forms processing, and more. OCR engine is capable of automatic layout analysis, recognition of selected field on paper, etc. The OCR engine is also compatible with different languages and multiple documents. Users are able to locate a single word within an entire multi-page document; this saves time searching, editing, & copying. Thereby the employees can be assigned other productive tasks. The super-accurate OCR engine identifies text within low resolution captured documents, documents containing multi-directional text, and documents containing color text. More accurate OCR results translate to greater efficiency in managing scanned documents.