More

SQL filter to show one layer only if another layer is missing

SQL filter to show one layer only if another layer is missing


I have two oracle tables that represent buildings in my town. The first table represent buildings that are surveyed and measured accurately but doesn't cover all the parts of the town while the second table represent not so accurately measured buildings but covers all the town.

table_1 - accurate data - partial coverage table_2 - not accurate data - full coverage

I want to write some sort of an SQL filter to show the data from table_2 only if no data in table_1 are present in that area. In other words, I want the whole town to be covered by buildings and to show all the areas that have the accurate buildings and fill up the rest of the areas with the not so accurate buildings!

The tables are stored in a Oracle 11g database. The map application I am using is built on MapServer and the layers are basically defined in XML files that accepts SQL filters.

So far, I have done the following:

select table_2.* from table_2, table_1 WHERE SDO_ANYINTERACT(table_2.geom, table_1.geom)= 'TRUE';

The problem with this method is that it gives me the places of the inaccurate buildings when it interacts with the accurate buildings. What I really want is to set the SDO_ANYINTERACT to 'FALSE' to get the inaccurate buildings in the places where no interaction is occurring between the buildings but of course Oracle gives error when setting that to 'FALSE'.

Any suggestions?


Assuming both tables use some building ID that is the same for a building in either table, something like

SELECT [columns] FROM table_1 UNION SELECT [columns] FROM table_2 WHERE NOT EXISTS (SELECT 1 FROM table_1 WHERE table_1.BuildingID = table_2.BuildingID)

You have to take a look at your two data sets and find out some rule that defines if two buildings is the same.

For instance you can check if the centroid of one epolygon is inside a polygon i the other layer.

I am not familliar with the functions in Oracle vut I guess it is about the same as PostGIS.

So, if you can use the method above it woild look something like this in postgis:

SELECT COALESCE(a.geom,b.geom) geom FROM accurateTable a FULL JOIN draftTable b ON ST_Intersects(ST_Centroid(a.geom),b.geom)


SAP-HANA-Calculation-Views

More and more SAP system users are implementing SAP HANA as the foundation and database for existing SAP BW installations. Some may just be replacing an existing database with the modern SAP HANA database to gain speed in query execution and warehouse management. This is, of course, a keystone of an SAP HANA implementation within SAP BW, but not the only one.

From my point of view, the tight integration of SAP HANA functionality using SAP HANA studio or Eclipse to design new ways of data warehouse management is at least as important as reporting performance.

Why use old habits with SAP application layers, storing more and more data in persistency layers, when it’s possible to calculate new reporting data on the fly? Using SAP HANA-based staging to enhance data flows without the physical staging of incoming data in several layers (SAP staging layer architecture [SLA]) is something I am working on at a customer site.

My approach can help you to be flexible when it comes to design changes because there is no unload or reload of data in staging layers when business logic changes. It also keeps your operation costs for SAP HANA at a minimal level because less data in an SAP HANA database means less expense for licensing.

(In a modern SAP BW powered by SAP HANA environment, it is wise to switch to virtual data staging instead of old-school data staging via persistency layers. SAP HANA calculation views and procedures for complex scenarios help to calculate data on the fly to cut down unload and reload phases to zero when it comes to business-related changes in data staging or extension/reduction in a data model. In the SAP HANA environment using SAP HANA studio or Eclipse, SAP BW operators can very easily switch from the old-fashioned models to the new calculation-view-based modeling concept or run a mixed scenario of both worlds.)

An SAP HANA procedure is a database-stored procedure that gives you programming functionality with the help of SAP HANA SQL Script, similar to ABAP functions. Now I guide you through the creation of SAP HANA data flows with the help of calculation views (graphical and SQL Script-based views), as well as some pitfalls you may run into.

My example (Figure 1) is a simple mixed approach.

An example of the mixed data approach using calculation views

That means you use existing SAP BW inbound data (DataStore object [DSO] based, as in an SAP BW entry layer). You also join SAP BW table information (such as an SAP Data Dictionary [DDIC] table) into one combined calculation view you can use in combination with an SAP HANA Composite Provider for BEx or Analysis for Office reporting. You can also use calculation views directly in Analysis for Office reporting. (I do not cover authorizations such as direct reporting on calculation views without a Composite Provider).

All the screenprints are based on Eclipse as the development studio for SAP HANA BW development. I am running Eclipse Neon 3 and the latest SAP BW and SAP HANA add-ons. The actual set of SAP BW modelling tools can be found using SAP Note1944835 – SAP BW Modeling Tools – Delivery Schedule 1944835 – SAP BW Modeling Tools – Delivery Schedule.

In the following explanations and examples, I use data based on an external SAP HANA view. In my example, I use technical DSO content data to feed data and consume data in a graphical calculation view. I use this data in combination with a SQL Script calculation view in a third graphical calculation view to join both data. I explain the difference in accessing data by regular joins in a graphical calculation view compared with SQL Script-based calculation views and explain the advantages of SQL Script calculation views. At the end, you can use the third calculation view, joining all views to a single point, to feed a Composite Provider for reporting users with the help of BEx queries or directly in combination with Analysis for Office.

(Note: I assume that you are familiar with SAP HANA studio or Eclipse. Therefore, I do not explain the basics, such as switching perspectives.)

Start Your Walk Through SAP HANA Calculation Views

First, switch to the SAP HANA Development perspective (Figure 2). I recommend using this perspective in the case of SAP HANA development. The reason for this recommendation is that in all other perspectives you are unable to create SAP HANA hdb procedures.

The SAP HANA Development perspective

The first calculation view is a plain graphical view. Right-click your development package and select New and then Calculation View as shown in Figure 3.

Dialog to choose a calculation view

This action displays the screen shown in Figure 4.

(Note: If your screen looks different, make sure that you have selected the Repositories tab. I recommend using this approach because via an SAP HANA development perspective and repository display, all functionalities that can be used that might be obsolete in the SAP HANA modeling perspective and Systems tab are displayed on the screen.)

In Figure 4 enter your desired calculation view technical name, such as MY_NEW_VIEW (not shown). Keep the default Type, which is Graphical, and click the Finish button.

By default, the system generates two objects: Semantics and Aggregation objects (Figure 5).

SAP HANA calculation view scenario area

On the left, you find all the available design objects. Start with the Projection option. Drag and drop it to the blank area of the scenario. The Projection gives access to all available objects within SAP HANA, such as DDIC tables, InfoProviders, master data, and already existing SAP HANA views.

When moving your mouse over the projection object (Figure 6), you see a green plus sign.

Projection with displayed data access symbol

Click that plus icon (data access) to open a dialog in which you can search for your desired object. You can read the data and then add that object to your actual view.

The dialog that opens after you click the plus icon allows you to search for your desired object. In my example, I entered the search string for SAP BW technical content object WIS_C03 (a copy of SAP technical cube TCT_C03, which is an InfoCube). As you can see in Figure 7, the search result returns all available SAP HANA objects, such as the InfoProvider itself and all existing partial (e.g., dimensions) tables.

Search result for projections

I choose the InfoCube itself, which is indicated by the Cxx postfix (the third item from the top), by double-clicking the entry or simply selecting the entry and selecting the OK button. I recommend creating ADSO (a type of standard DSO) technical names that refer to its type such as C (for type Cube).

Besides adding the selection to your projection and displaying the name, you see the structure of the selected object in the studio detail pane, (in my example the InfoCube structure) displaying InfoObjects as well as cube SID entries (Figure 8).

Projection with displayed structure

Clicking the bullet-shaped icons in the detail view adds that particular InfoObject to the output (Figure 9) of that projection and changes its color to orange (selected). This selection is similar to adding fields of a table to a customized view in transaction SE11. The LED shaped icon works like a toggle switch. A gray color means the field is turned off (for usage), while orange means the field has been turned on.

Projection with three selected InfoObjects

Because InfoCubes use SIDs as well as InfoObject keys, the selection might be a bit time-consuming. For a better view to InfoCubes, I recommend activating the external SAP HANA view of that particular InfoCube via the change view in the Administration Workbench. Or you could automate the activation of all cubes /DSO.

If you already activated the external SAP HANA view, you can select that view directly from the Projection search dialog (Figure 10).

Activated external SAP HANA view

Selecting this external SAP HANA view makes your life with calculation views much easier. As you can see (Figure 11), only InfoObjects and their text elements, if they exist, are displayed. You can add them to the output of your projection.

Projection of external SAP HANA view for DSO type InfoCube WIS_C01

To map the projection to the existing aggregation object in the scenario pane, click the circle icon above Projection_2. Hold down the mouse, drag a connection (line) to the bottom icon of the aggregation object, and release the mouse. Now your projection is connected (Figure 12).

Projection is connected to Aggregation

Selecting the aggregation object unveils the structure of all the fields of your projection. You can decide if you like all the fields or just a few. If you like all of them, the best approach is to right-click the header area (black header) and select the Add All To Output (Figure 13) option, which automatically maps all existing fields to the output without having to select the entries one by one.

Add all the fields to the output dialog

After this step is done, you can activate the calculation view by clicking the activation icon from the top menu in Eclipse (Figure 14).

Activate the calculation view

After activation, you can directly display the data for each individual object by right-clicking the object (e.g., projection) and selecting Data Preview (Figure 15). (The details of querying the data are beyond the scope of this article.)

Dialog to display the data preview

To enrich my example with some more details, I add the data from a new ADSO. As before, drag Projection to the scenario, click the plus icon, and select your desired object (in my case the InfoProvider is named WIS_D01 as shown in Figure 16).

Add DSO WIS_D01 to the second projection

You should see both projections. The first one is already connected to the aggregation object. Now you join those two tables.

(So that you do not lose all existing mapping, drag and drop the join object from the right side directly onto the connection line between the first projection and the aggregation object. This automatically adds the join without breaking any mapping. Now you should have the scenario shown in Figure 17.)

A scenario with the added join object

The last step is to drag the output of the second projection to the input of the displayed join.

Since you now have a join between both objects (WIS_C01 and WIS_D01), you need to define the join condition between your desired fields.

Just select the output fields by clicking the orange bullets and for the join condition drag and drop a line between the join objects. I joined 0CALDAY by transferring the data from Projection_2 and just turning on additional fields of Projection_1. InfoObject 0CALDAY (Figure 18) is now the join condition, but it will not be transferred to the output data because otherwise you would have the field 0CALDAY twice.

Join with join condition on 0CALDAY

Right-clicking the join connection opens a dialog in which you can swap the tables (important for outer joins) and display the Edit… option (Figure 19).

When you are using the join edit mode, a new dialog appears in which you can set the different join conditions as well as cardinality (Figure 20). To keep it simple leave the standard join condition (inner join) active.

As you can see, the graphical join does not allow you to define complex join conditions (e.g., with WHERE statements or complex filters) as done in ABAP joins. To create such joins, you need SQL-based calculation views. In these calculation views you can define whatever join condition you like.

As an alternative, you can pass the coding for such joins and table accesses to an SAP HANA procedure to be more flexible. These procedures can be used system wide for all purposes. First, you have to create a new procedure. Right-click your package and select New and then Other… as shown in Figure 21.

In the next dialog select Stored Procedure (Figure 22).

Provide a name and target schema for the new procedure (Figure 23).

The initial dialog for stored procedures (hdb procedures)

Now enter your SQL Script code (Figure 24). My example adds the activation and request information from SAP BW tables RSREQDONE and RSMONICDP to the final calculation view.


3 Answers 3

How large are these images, and how many do you expect to have? While I mostly agree with @sp_BlitzErik, I think there are some scenarios where it is ok to do this, and so it would help to have a clearer picture of what is actually being requested here.

Some options to consider that alleviate most of the negative aspects pointed out by Erik are:

Both of these options are designed to be a middle-ground between storing BLOBs either fully in SQL Server or fully outside (except for a string colun to retain the path). They allow for BLOBs to be a part of the data model and participate in Transactions while not wasting space in the buffer pool (i.e. memory). The BLOB data is still included in backups, which does make them take up more space and take longer to backup and to restore. However, I have a hard time seeing this as a true negative given that if it is part of the app then it needs to be backed up somehow, and having only a string column containing the path is completely disconnected and allows for BLOBs files to get deleted with no indication of that in the DB (i.e. invalid pointers / missing files). It also allows for files to be "deleted" within the DB but still exist on the file system which will need to eventually be cleaned up (i.e. headache). But, if the files are HUGE, then maybe it is best to leave entirely outside of SQL Server except for the path column.

That helps with the "inside or outside" question, but does not touch on the single table vs multiple table question. I can say that, beyond this specific question, there are certainly valid cases for splitting tables into groups of columns based on usage patterns. Often when one has 50 or more columns there are some that are accessed frequently and some that are not. Some columns are written to frequently while some are mostly read. Separating frequently access vs infrequently accessed columns into multiple tables having a 1:1 relationship is quite often beneficial because why waste the space in the Buffer Pool for data you probably aren't using (similar to why storing large images in regular VARBINARY(MAX) columns is a problem)? You also increase the performance of the frequently access columns by reducing the row size and hence fitting more rows onto a data page, making reads (both physical and logical) more efficient. Of course, you also introduce some inefficiency by needing to duplicate the PK, and now sometimes you need to join the two tables, which also complicates (even if only slightly) some queries.

So, there are several approaches you could take, and what is best depends on your environment and what you are trying to accomplish.

I was under the impression that SQL Server only stores a pointer to some dedicated BLOB data structure in the table

Not so simple. You can find some good info here, What is the Size of the LOB Pointer for (MAX) Types Like Varchar, Varbinary, Etc?, but the basics are:


3 Answers 3

(Adding a new answer which should be definitive, leaving the old around as it's useful debug for how we got here. Credit for pointing to the actual answer in comments goes to @P4cK3tHuNt3R and @dave_thompson_085)

Using Wireshark, I am trying to determine the version of SSL/TLS that is being used with the encryption of data between a client workstation and another workstation on the same LAN running SQL Server.

You are viewing a connection which uses MS-TDS ("Tabular Data Stream Protocol"):

If you view the TDS protocol documentation, it specifies that the SSL packets are encapsulated within a TDS wrapper:

In the Microsoft Message Analyzer screencap you posted, we can see the TDS header (boxed in Red, starts with 0x12), followed several bytes later by the TLS CLIENT_HELLO packet (boxed in Blue, starts with 0x16 0x03 0x03):

0x03 0x03 is the TLS version (TLS 1.2, as per RFC 5246):

The version of the protocol being employed. This document describes TLS Version 1.2, which uses the version < 3, 3 >. The version value 3.3 is historical, deriving from the use of <3, 1>for TLS 1.0.

So the simple answer to your question, "determine the version of SSL/TLS", is "TLS 1.2".

Now, I've seen varying reports as to whether Wireshark can properly parse TDS packets with encoded TLS. I think that the answer is what you started with - it will tell you TLS is there, but won't parse the details as it would with a native TLS session.

As per this StackOverflow question, it appears that Microsoft Network Monitor is capable of parsing both levels of encapsulation. And a comment therein states that Microsoft Message Analyzer is the newer equivalent of that tool.

I just use this filter in Wireshark to find TLS 1.0 traffic:

0x0302 is TLS 1.1 and 0x0303 is TLS 1.2.

(Ignore this answer, which I'm leaving for historical data, and read my other answer, which explains what's actually going on)

Update after an example packet was added to the question -

The packet you've provided is clearly not a TLS packet. Looking at the hex you've provided, the first three octets of the TCP data are 12 01 00 , but for a TLS packet the first three bytes should be 16 03 0X , where 0x16 means TLS "Handshake" record type, 0x03 means SSLv3/TLSv1.*, and the 0x0X indicates the TLS version - 0x01 for TLS 1.0, 0x02 for TLS 1.1, and 0x03 for TLS 1.2.

Additionally, there's a cleartext "sqlexpress2012" string in the packet, which wouldn't be there if this was a TLS Client Hello.

(How did I decide 12 01 00 was the beginning of the data? The first 14 bytes of the packet are the Ethernet header. The next 20 bytes are the IP header. The 13th byte of the TCP header is 0x50, and the first nibble of that byte times 4 is the TCP header length, so 5*4 = 20. So the first bytes of actual data start 54 bytes in at 12 01 00 6c 00 00 . )

So if Wireshark won't display this as TLS, that's because it isn't. You should revisit your server configuration.

Original answer:

Because those packets are not on a standard TLS port (e.g., 443) you need to tell Wireshark to interpret them as TLS packets. By default port 1433 is not interpreted as having TLS the default for TDS is to be unencrypted. So by itself Wireshark will not parse it as TLS:

In order to change this, right-click on one of the packets and select "Decode As". Make sure the port "value" is set to 1433 and then set "Current" to SSL:

Click OK and when you return to the packets you'll see they're now interpreted in more detail:

Finally, if you look at the detail pane for one of the packets (I suggest using the server hello, not the client hello, in case protocol was adjusted) you'll see the TLS version quite clearly:


Imperva Application Security

Imperva security solutions secure your applications across multiple layers of the OSI model, from the network layer, protected by Imperva DDoS mitigation, to Imperva’s web application firewall (WAF), bot management and API security technology that safeguards the application layer.

To secure applications and networks across the OSI stack, Imperva provides multi-layered protection to make sure websites and applications are available, easily accessible and safe. The Imperva applicati on security solution includes:


4 Answers 4

First hoping that IP address control can add strong security is kind of a brave assumption. IP theft should always be considered as a possible attack.

Next, if there is no firewall at all in front of a SQL server, the most serious risk is not legimitimate requests (i.e. with a correct username and password) coming from unlegitimate origins, but rogue attacks targetting possible flaws on the server itself, like unpatched vulnerabilities, or unwanted services left opened. This one would really be a risk on a production server.

Finally, end users should not directly connect to the database server on modern systems. A database server is a very large piece of code that admins prefere to hide behind an application server which is the only system that directly connects to the database. But if the application server and the database server are not in the same datacenter, the security of the database is only provided by its passwords. Full stop. At best, you could try to setup a VPN between the application host and the database server to provide an additional security layer, but I do not know whether it is an option on WinHost.

In the end, all this will boils down to security has a cost and the higher the security, to more expensive. But only you can know whether your security requirements are or not compatible with WinHost hosting: what are the threats, what is the risk/cost if the database is compromissed.

Someone somehow gets a hold of some credentials to login .

If you are there, do not worry for IP blocking or not, specially if you have sensitive or valuable informations. It is just time to change all the passwords and control the current data with the last safe backup. IP theft is a thing.


5 Answers 5

Regardless of platform, the following remarks apply.

are harder to understand and debug

e.g. What table column does this view column refer to? Lemme dig through 4 levels of view definitions.

make it harder for the query optimizer to come up with the most efficient query plan

See this, this, this, and this for anecdotal evidence. Compare to this, which shows that the optimizer is often smart enough to correctly unpack nested views and select an optimal plan, but not without a compilation cost.

You can measure the performance cost by comparing the view query to an equivalent one written against the base tables.

(+) On the other hand, nested views let you:

  • centralize and reuse aggregations or business rules
  • abstract away your underlying structure (say, from other database developers)

I've found that they are rarely necessary.

In your example you are using nested views to centralize and reuse certain business definitions (e.g. "What is an eligible student?"). This is a valid use for nested views. If you are maintaining or tuning this database, weigh the cost of keeping them against that of removing them.

Keep: By keeping the nested views you incur the advantages and disadvantages enumerated above.

Remove: To remove the nested views:

You need to replace all occurrences of the views with their base queries.

You must remember to update all relevant queries if your definition of eligible student/teacher/school changes, as opposed to just updating the relevant view definition.

Sometimes nested views are used to prevent repeating aggregates. Let's say you have a view that counts messages and groups them by userid, you might have a view over that that counts the number of users that have > 100 messages, that kind of thing. This is most effective when the base view is an indexed view - you don't necessarily want to create yet another indexed view to represent the data with a slightly different grouping, because now you're paying for the index maintenance twice where performance is probably adequate against the original view.

If these are all just nested views where you're doing select * but changing the ordering or top, it seems this would be better encapsulated as a stored procedure with parameters (or inline table-valued functions) than a bunch of nested views. IMHO.

Later versions of SQL (2005+) seem better at optimizing the use of views. Views are best for consolidating business rules. EG: where I work we have a telecom product database. Each product is assigned to a rateplan, and that rateplan can get swapped out, and rates on the rateplan can get activated/deacitvated as rates are increased or modified.

To make it easy, we can make nested views. 1st view just joins the rateplans to their rates using whatever tables are needed, and returning any necessary data the next levels of views would need. 2nd view(s) can isolate only active rateplans and their active rates. Or, just customer rates. Or employee rates (for employee discount). Or business vs. residential customer rates. (rateplans can get complicated). The point is, the foundation view ensures our overall business logic for rateplans and rates are joined together properly in one location. The next layer of views give us more focus on specific rateplans (types, active/inactive, etc).

I agree that views can make debugging messy if you're building queries and views at the same time. But, if you're using a tried-n-trusted view, it makes debugging easier. You know that view has already been through the ringer, so you know it's most likely not causing the problem.

Issues can come up with your views, though. "what if a product is associated only to an inactive rateplan?" or "what if a rateplan only has inactive rates on it?" Well, that can get caught at the front-end level with logic that catches user errors. "Error, product is on an inactive rateplan. please correct". We can also run query audits to double check it before a billing run. (select all plans and left join to active rateplan view, only return plans that don't get an active rateplan as problems that need to get addressed).

The good thing about this is the views let you greatly condense down queries for reporting, billing, etc. You can have a customer account view, then a 2nd-level view of just active customers. Team that with a view of customer address. Team that with a view of product(s) (joined on what product(s) customer has). Team that to view of product(s) rateplan. Team that with view of product features. View, view, view, each trial-n-errored to ensure integrity. Your end query using the views is very compact.

As an example of how the view would have been better than just a flat query of tables . we had a temp contractor come in to make some changes. They told him there were views for things, but he decided to flatten all of his queries. Billing was running things off of some of his queries. They kept getting multiple rateplans and rates on things. Turns out his queries were missing criteria to only allow rates to bill if they were between the start & end dates the rate plan was supposed to use that/those rates during. Oops. If he had used the view, it would have already taken that logic into account.

Basically, you have to weigh performance vs. sanity. Maybe you can do all kinds of fancy stuff to increase the performance of a database. But, if it means it's a nightmare for a new person to take-over / maintain, is it really worth it? Is it really worth the new guy having to play whack-a-mole having to find all the queries that need to get their logic changed (and risk him forgetting / fat-fingering them) b/c someone decided views are "bad" and didn't consolidate some core business logic into one that could get used in 100's of other queries? It's really up to your business and your IT/IS/DB team. But, I'd prefer clarity and single-source consolidation over performance.


2 Answers 2

If it's been hidden in the viewport in 2.79 you can make it visible in 2.8 by going to the Outliner.

Click on the filter icon and under Restriction toggles select Disable in Viewport (screen icon).

You can now set the collection back to being visible by clicking on the screen icon behind the collection.

I also have such a problem, but I solved it. Here is what you need to do: Right-click on “Collection 1” -> “Visibility” -> “Enable in viewport”. That's it, now hidden layers will become active.

P.S. I also noticed that the files saved in the new version of the program, unfortunately, do not open in older versions of the program. The blender just closes when you try to open new files. No version compatibility, this is sad.

It seems like I was typing a message for too long, they got ahead of me :)


Interaction with other Tableau features and products

Does Explain Data work with multi-table data sources that use relationships?

In 2020.3, you can use Explain Data with data sources that contain multiple, related tables. Cardinality and Referential Integrity settings for relationships must be set up correctly for Explain Data to analyze multi-table, related data.

In 2020.2, you can use Explain Data with a single-table data sources only. Your data source can have a single, logical table that is defined by one or more physical tables.

Does Ask Data work with multi-table data sources?

Ask Data fully supports multi-table, normalized data sources.

How do new data modeling capabilities affect using Tableau Bridge?

You will need to update to the latest version of Tableau Bridge for full compatibility with 2020.2 data modeling functionality.

When should I use Tableau Prep vs. authoring in Tableau Desktop, Tableau Online, or Tableau Server to create a data source?

Tableau Prep cleans data, and creates flows, extracts, and published data sources that contain physical tables.

In Tableau Desktop, and in Tableau Online and Tableau Server web authoring, you can create data sources that use normalized data models. These data models can be made of logical tables and physical tables, and your data sources can be saved as live data sources or as extracts.

Only logical tables can be related. Physical tables can be joined and unioned.


Data Warehousing

Layers in an enterprise data warehouse architecture

Data coming into the data warehouse and leaving the data warehouse use extract, transform, and load (ETL) to pass through logical structural layers of the architecture that are connected using data integration technologies, as depicted in Figure 7.1 , where the data passes from left to right, from source systems to the data warehouse and then to the business intelligence layer. In many organizations, the enterprise data warehouse is the primary user of data integration and may have sophisticated vendor data integration tools specifically to support the data warehousing requirements. Data integration provides the flow of data between the various layers of the data warehouse architecture, entering and leaving.

Figure 7.1 . Data Warehouse Data Flow.

Operational application layer

The operational application layer consists of the various sources of data to be fed into the data warehouse from the applications that perform the primary operational functions of the organization. This layer is where the portfolio of core application systems for the organization resides. Not all reporting is necessarily transferred to the data warehouse. Operational reporting concerning the processing within a particular application may remain within the application because the concerns are specific to the particular functionality and needs associated with the users of the application.

External data

Some data for the data warehouse may be coming from outside the organization. Data may be supplied for the warehouse, with further detail sourced from the organization’s customers, suppliers, or other partners. Standard codes, valid values, and other reference data may be provided from government sources, industry organizations, or business exchanges. Additionally, many data warehouses enhance the data available in the organization with purchased data concerning consumers or customers.

External data must pass through additional security access layers for the network and organization, protecting the organization from harmful data and attacks.

External data should be viewed as less likely to conform to the expected structure of its contents, since communication and agreement between separate organizations is usually somewhat harder than communications within the same organization. Profiling and quality monitoring of data acquired from external sources is very important, even more critical, possibly, than for monitoring data from internal sources. Integration with external data should be kept loosely coupled with the expectation of potential changes in format and content.

Data staging areas coming into a data warehouse

Data coming into a data warehouse is usually staged, or stored in the original source format, in order to allow a loose coupling of the timing between the source and the data warehouse in terms of when the data is sent from the source and when it is loaded into the warehouse. The data staging area also allows for an audit trail of what data was sent, which can be used to analyze problems with data found in the warehouse or in reports.

There is usually a staging area located with each of the data sources, as well as a staging area for all data coming in to the warehouse.

Some data warehouse architectures include an operational data store (ODS) for having data available real time or near real time for analysis and reporting. Real-time data integration techniques will be described in later sections of this book.

Data warehouse data structure

The data in the data warehouse is usually formatted into a consistent logical structure for the enterprise, no longer dependent on the structure of the various sources of data. The structure of data in the data warehouse may be optimized for quick loading of high volumes of data from the various sources. If some analysis is performed directly on data in the warehouse, it may also be structured for efficient high-volume access, but usually that is done in separate data marts and specialized analytical structures in the business intelligence layer.

Metadata concerning data in the data warehouse is very important for its effective use and is an important part of the data warehouse architecture: a clear understanding of the meaning of the data (business metadata), where it came from or its lineage (technical metadata), and when things happened (operational metadata). The metadata associated with the data in the warehouse should accompany the data that is provided to the business intelligence layer for analysis.

Staging from data warehouse to data mart or business intelligence

There may be separate staging areas for data coming out of the data warehouse and into the business intelligence structures in order to provide loose coupling and audit trails, as described earlier for data coming into the data warehouse. However, since writing data to disk and reading from disk (I/O operations) are very slow compared with processing, it may be deemed more efficient to tightly couple the data warehouse and business intelligence structures and skip much of the overhead of staging data coming out of the data warehouse as well as going into the business intelligence structures. An audit trail between the data warehouse and data marts may be a low priority, as it is less important than when the data was last acquired or updated in the data warehouse and in the source application systems. Speed in making the data available for analysis is a larger concern.

Business Intelligence Layer

The business intelligence layer focuses on storing data efficiently for access and analysis.

Data marts are data structures created for providing to a particular part of an organization data relevant to their analytical needs, structured for fast access. Data marts may also be for enterprise-wide use but using specialized structures or technologies.

Extract files from the data warehouse are requested for local user use, for analysis, and for preparation of reports and presentations. Extract files should not usually be manually loaded into analytical and reporting systems. Besides the inefficiency of manually transporting data between systems, the data may be changed in the process between the data warehouse and the target system, losing the chain of custody information that would concern an auditor. A more effective and trusted audit trail is created by automatically feeding data between systems.

Extract files are sometimes also needed to be passed to external organizations and entities. As with all data passing out from the data warehouse, metadata fully describing the data should accompany extract files leaving the organization.

Data from the data warehouse may also be fed into highly specialized reporting systems, such as for customer statement or regulatory reporting, which may have their own data structures or may read data directly from the data warehouse.

Data in the business intelligence layer may be accessed using internal or external web solutions, specialized reporting and analytical tools, or generic desktop tools. Appropriate access authority and audit trails should be stored tracking all data accesses into the data warehouse or business intelligence layers.