summaryrefslogtreecommitdiffstats
path: root/thirdparty
diff options
context:
space:
mode:
Diffstat (limited to 'thirdparty')
-rw-r--r--thirdparty/etc2comp/AUTHORS14
-rw-r--r--thirdparty/etc2comp/LICENSE404
-rw-r--r--thirdparty/etc2comp/README.md394
-rw-r--r--thirdparty/libtheora/x86_vc/mmxencfrag.c1938
-rw-r--r--thirdparty/libtheora/x86_vc/mmxfdct.c1340
-rw-r--r--thirdparty/nanosvg/LICENSE.txt36
6 files changed, 2063 insertions, 2063 deletions
diff --git a/thirdparty/etc2comp/AUTHORS b/thirdparty/etc2comp/AUTHORS
index 32daca27fe..e78a7f4d21 100644
--- a/thirdparty/etc2comp/AUTHORS
+++ b/thirdparty/etc2comp/AUTHORS
@@ -1,7 +1,7 @@
-# This is the list of Etc2Comp authors for copyright purposes.
-#
-# This does not necessarily list everyone who has contributed code, since in
-# some cases, their employer may be the copyright holder. To see the full list
-# of contributors, see the revision history in source control.
-Google Inc.
-Blue Shift Inc.
+# This is the list of Etc2Comp authors for copyright purposes.
+#
+# This does not necessarily list everyone who has contributed code, since in
+# some cases, their employer may be the copyright holder. To see the full list
+# of contributors, see the revision history in source control.
+Google Inc.
+Blue Shift Inc.
diff --git a/thirdparty/etc2comp/LICENSE b/thirdparty/etc2comp/LICENSE
index 75b52484ea..d645695673 100644
--- a/thirdparty/etc2comp/LICENSE
+++ b/thirdparty/etc2comp/LICENSE
@@ -1,202 +1,202 @@
-
- Apache License
- Version 2.0, January 2004
- http://www.apache.org/licenses/
-
- TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
-
- 1. Definitions.
-
- "License" shall mean the terms and conditions for use, reproduction,
- and distribution as defined by Sections 1 through 9 of this document.
-
- "Licensor" shall mean the copyright owner or entity authorized by
- the copyright owner that is granting the License.
-
- "Legal Entity" shall mean the union of the acting entity and all
- other entities that control, are controlled by, or are under common
- control with that entity. For the purposes of this definition,
- "control" means (i) the power, direct or indirect, to cause the
- direction or management of such entity, whether by contract or
- otherwise, or (ii) ownership of fifty percent (50%) or more of the
- outstanding shares, or (iii) beneficial ownership of such entity.
-
- "You" (or "Your") shall mean an individual or Legal Entity
- exercising permissions granted by this License.
-
- "Source" form shall mean the preferred form for making modifications,
- including but not limited to software source code, documentation
- source, and configuration files.
-
- "Object" form shall mean any form resulting from mechanical
- transformation or translation of a Source form, including but
- not limited to compiled object code, generated documentation,
- and conversions to other media types.
-
- "Work" shall mean the work of authorship, whether in Source or
- Object form, made available under the License, as indicated by a
- copyright notice that is included in or attached to the work
- (an example is provided in the Appendix below).
-
- "Derivative Works" shall mean any work, whether in Source or Object
- form, that is based on (or derived from) the Work and for which the
- editorial revisions, annotations, elaborations, or other modifications
- represent, as a whole, an original work of authorship. For the purposes
- of this License, Derivative Works shall not include works that remain
- separable from, or merely link (or bind by name) to the interfaces of,
- the Work and Derivative Works thereof.
-
- "Contribution" shall mean any work of authorship, including
- the original version of the Work and any modifications or additions
- to that Work or Derivative Works thereof, that is intentionally
- submitted to Licensor for inclusion in the Work by the copyright owner
- or by an individual or Legal Entity authorized to submit on behalf of
- the copyright owner. For the purposes of this definition, "submitted"
- means any form of electronic, verbal, or written communication sent
- to the Licensor or its representatives, including but not limited to
- communication on electronic mailing lists, source code control systems,
- and issue tracking systems that are managed by, or on behalf of, the
- Licensor for the purpose of discussing and improving the Work, but
- excluding communication that is conspicuously marked or otherwise
- designated in writing by the copyright owner as "Not a Contribution."
-
- "Contributor" shall mean Licensor and any individual or Legal Entity
- on behalf of whom a Contribution has been received by Licensor and
- subsequently incorporated within the Work.
-
- 2. Grant of Copyright License. Subject to the terms and conditions of
- this License, each Contributor hereby grants to You a perpetual,
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
- copyright license to reproduce, prepare Derivative Works of,
- publicly display, publicly perform, sublicense, and distribute the
- Work and such Derivative Works in Source or Object form.
-
- 3. Grant of Patent License. Subject to the terms and conditions of
- this License, each Contributor hereby grants to You a perpetual,
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
- (except as stated in this section) patent license to make, have made,
- use, offer to sell, sell, import, and otherwise transfer the Work,
- where such license applies only to those patent claims licensable
- by such Contributor that are necessarily infringed by their
- Contribution(s) alone or by combination of their Contribution(s)
- with the Work to which such Contribution(s) was submitted. If You
- institute patent litigation against any entity (including a
- cross-claim or counterclaim in a lawsuit) alleging that the Work
- or a Contribution incorporated within the Work constitutes direct
- or contributory patent infringement, then any patent licenses
- granted to You under this License for that Work shall terminate
- as of the date such litigation is filed.
-
- 4. Redistribution. You may reproduce and distribute copies of the
- Work or Derivative Works thereof in any medium, with or without
- modifications, and in Source or Object form, provided that You
- meet the following conditions:
-
- (a) You must give any other recipients of the Work or
- Derivative Works a copy of this License; and
-
- (b) You must cause any modified files to carry prominent notices
- stating that You changed the files; and
-
- (c) You must retain, in the Source form of any Derivative Works
- that You distribute, all copyright, patent, trademark, and
- attribution notices from the Source form of the Work,
- excluding those notices that do not pertain to any part of
- the Derivative Works; and
-
- (d) If the Work includes a "NOTICE" text file as part of its
- distribution, then any Derivative Works that You distribute must
- include a readable copy of the attribution notices contained
- within such NOTICE file, excluding those notices that do not
- pertain to any part of the Derivative Works, in at least one
- of the following places: within a NOTICE text file distributed
- as part of the Derivative Works; within the Source form or
- documentation, if provided along with the Derivative Works; or,
- within a display generated by the Derivative Works, if and
- wherever such third-party notices normally appear. The contents
- of the NOTICE file are for informational purposes only and
- do not modify the License. You may add Your own attribution
- notices within Derivative Works that You distribute, alongside
- or as an addendum to the NOTICE text from the Work, provided
- that such additional attribution notices cannot be construed
- as modifying the License.
-
- You may add Your own copyright statement to Your modifications and
- may provide additional or different license terms and conditions
- for use, reproduction, or distribution of Your modifications, or
- for any such Derivative Works as a whole, provided Your use,
- reproduction, and distribution of the Work otherwise complies with
- the conditions stated in this License.
-
- 5. Submission of Contributions. Unless You explicitly state otherwise,
- any Contribution intentionally submitted for inclusion in the Work
- by You to the Licensor shall be under the terms and conditions of
- this License, without any additional terms or conditions.
- Notwithstanding the above, nothing herein shall supersede or modify
- the terms of any separate license agreement you may have executed
- with Licensor regarding such Contributions.
-
- 6. Trademarks. This License does not grant permission to use the trade
- names, trademarks, service marks, or product names of the Licensor,
- except as required for reasonable and customary use in describing the
- origin of the Work and reproducing the content of the NOTICE file.
-
- 7. Disclaimer of Warranty. Unless required by applicable law or
- agreed to in writing, Licensor provides the Work (and each
- Contributor provides its Contributions) on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
- implied, including, without limitation, any warranties or conditions
- of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
- PARTICULAR PURPOSE. You are solely responsible for determining the
- appropriateness of using or redistributing the Work and assume any
- risks associated with Your exercise of permissions under this License.
-
- 8. Limitation of Liability. In no event and under no legal theory,
- whether in tort (including negligence), contract, or otherwise,
- unless required by applicable law (such as deliberate and grossly
- negligent acts) or agreed to in writing, shall any Contributor be
- liable to You for damages, including any direct, indirect, special,
- incidental, or consequential damages of any character arising as a
- result of this License or out of the use or inability to use the
- Work (including but not limited to damages for loss of goodwill,
- work stoppage, computer failure or malfunction, or any and all
- other commercial damages or losses), even if such Contributor
- has been advised of the possibility of such damages.
-
- 9. Accepting Warranty or Additional Liability. While redistributing
- the Work or Derivative Works thereof, You may choose to offer,
- and charge a fee for, acceptance of support, warranty, indemnity,
- or other liability obligations and/or rights consistent with this
- License. However, in accepting such obligations, You may act only
- on Your own behalf and on Your sole responsibility, not on behalf
- of any other Contributor, and only if You agree to indemnify,
- defend, and hold each Contributor harmless for any liability
- incurred by, or claims asserted against, such Contributor by reason
- of your accepting any such warranty or additional liability.
-
- END OF TERMS AND CONDITIONS
-
- APPENDIX: How to apply the Apache License to your work.
-
- To apply the Apache License to your work, attach the following
- boilerplate notice, with the fields enclosed by brackets "[]"
- replaced with your own identifying information. (Don't include
- the brackets!) The text should be enclosed in the appropriate
- comment syntax for the file format. We also recommend that a
- file or class name and description of purpose be included on the
- same "printed page" as the copyright notice for easier
- identification within third-party archives.
-
- Copyright [yyyy] [name of copyright owner]
-
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
+
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "[]"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright [yyyy] [name of copyright owner]
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
diff --git a/thirdparty/etc2comp/README.md b/thirdparty/etc2comp/README.md
index 1c70ae9f4e..2f4363d042 100644
--- a/thirdparty/etc2comp/README.md
+++ b/thirdparty/etc2comp/README.md
@@ -1,197 +1,197 @@
-# Etc2Comp - Texture to ETC2 compressor
-
-Etc2Comp is a command line tool that converts textures (e.g. bitmaps)
-into the [ETC2](https://en.wikipedia.org/wiki/Ericsson_Texture_Compression)
-format. The tool is built with a focus on encoding performance
-to reduce the amount of time required to compile asset heavy applications as
-well as reduce overall application size.
-
-This repo provides source code that can be compiled into a binary. The
-binary can then be used to convert textures to the ETC2 format.
-
-Important: This is not an official Google product. It is an experimental
-library published as-is. Please see the CONTRIBUTORS.md file for information
-about questions or issues.
-
-## Setup
-This project uses [CMake](https://cmake.org/) to generate platform-specific
-build files:
- - Linux: make files
- - OS X: Xcode workspace files
- - Microsoft Windows: Visual Studio solution files
- - Note: CMake supports other formats, but this doc only provides steps for
- one of each platform for brevity.
-
-Refer to each platform's setup section to setup your environment and build
-an Etc2Comp binary. Then skip to the usage section of this page for examples
-of how to use the library.
-
-### Setup for OS X
- build tested on this config:
- OS X 10.9.5 i7 16GB RAM
- Xcode 5.1.1
- cmake 3.2.3
-
-Start by downloading and installing the following components if they are not
-already installed on your development machine.
- - *Xcode* version 5.1.1, or greater
- - [CMake](https://cmake.org/download/) version 3.2.3, or greater
-
-To build the Etc2Comp binary:
- 1. Open a *Terminal* window and navigate to the project directory.
- 1. Run `mkdir build_xcode`
- 1. Run `cd build_xcode`
- 1. Run `cmake -G Xcode ../`
- 1. Open *Xcode* and import the `build_xcode/EtcTest.xcodeproj` file.
- 1. Open the Product menu and choose Build For -> Running.
- 1. Once the build succeeds the binary located at `build_xcode/EtcTool/Debug/EtcTool`
-can be executed.
-
-Optional
-Xcode EtcTool ‘Run’ preferences
-note: if the build_xcode/EtcTest.xcodeproj is manually deleted then some Xcode preferences
-will need to be set by hand after cmake is run (these prefs are retained across
-cmake updates if the .xcodeproj is not deleted/removed)
-
-1. Set the active scheme to ‘EtcTool’
-1. Edit the scheme
-1. Select option ‘Run EtcTool’, then tab ‘Arguments’.
-Add this launch argument: ‘-argfile ../../EtcTool/args.txt’
-1. Select tab ‘Options’ and set a custom working directory to: ‘$(SRCROOT)/Build_Xcode/EtcTool’
-
-### SetUp for Windows
-
-1. Open a *Terminal* window and navigate to the project directory.
-1. Run `mkdir build_vs`
-1. Run `cd build_vs`
-1. Run CMAKE, noting what build version you need, and pointing to the parent directory as the source root;
- For VS 2013 : `cmake -G "Visual Studio 12 2013 Win64" ../`
- For VS 2015 : `cmake -G "Visual Studio 14 2015 Win64" ../`
- NOTE: To see what supported Visual Studio outputs there are, run `cmake -G`
-1. open the 'EtcTest' solution
-1. make the 'EtcTool' project the start up project
-1. (optional) in the project properties, under 'Debugging ->command arguments'
-add the argfile textfile thats included in the EtcTool directory.
-example: -argfile C:\etc2\EtcTool\Args.txt
-
-### Setup For Linux
-The Linux build was tested on this config:
- Ubuntu desktop 14.04
- gcc/g++ 4.8
- cmake 2.8.12.2
-
-1. Verify linux has cmake and C++-11 capable g++ installed
-1. Open shell
-1. Run `mkdir build_linux`
-1. Run `cd build_linux`
-1. Run `cmake ../`
-1. Run `make`
-1. navigate to the newly created EtcTool directory `cd EtcTool`
-1. run the executable: `./EtcTool -argfile ../../EtcTool/args.txt`
-
-Skip to the <a href="#usage">Usage</a> section for more information about using the
-tool.
-
-## Usage
-
-### Command Line Usage
-EtcTool can be run from the command line with the following usage:
- etctool.exe source_image [options ...] -output encoded_image
-
-The encoder will use an array of RGBA floats read from the source_image to create
-an ETC1 or ETC2 encoded image in encoded_image. The RGBA floats should be in the
-range [0:1].
-
-Options:
-
- -analyze <analysis_folder>
- -argfile <arg_file> additional command line arguments read from a file
- -blockAtHV <H V> encodes a single block that contains the
- pixel specified by the H V coordinates
- -compare <comparison_image> compares source_image to comparison_image
- -effort <amount> number between 0 and 100 to specify the encoding quality
- (100 is the highest quality)
- -errormetric <error_metric> specify the error metric, the options are
- rgba, rgbx, rec709, numeric and normalxyz
- -format <etc_format> ETC1, RGB8, SRGB8, RGBA8, SRGB8, RGB8A1,
- SRGB8A1 or R11
- -help prints this message
- -jobs or -j <thread_count> specifies the number of threads (default=1)
- -normalizexyz normalize RGB to have a length of 1
- -verbose or -v shows status information during the encoding
- process
- -mipmaps or -m <mip_count> sets the maximum number of mipaps to generate (default=1)
- -mipwrap or -w <x|y|xy> sets the mipmap filter wrap mode (default=clamp)
-
-* -analyze will run an analysis of the encoding and place it in folder
-"analysis_folder" (e.g. ../analysis/kodim05). within the analysis_folder, a folder
-will be created with a name of the current date/time (e.g. 20151204_153306). this
-date/time folder is used to compare encodings of the same texture over time.
-within the date/time folder is a text file with several encoding stats and a 2x png
-image showing the encoding mode for each 4x4 block.
-
-* -argfile allows additional command line arguments to be placed in a text file
-
-* -blockAtHV selects the 4x4 pixel subset of the source image at position (H,V).
-This is mainly used for debugging
-
-* -compare compares the source image to the created encoded image. The encoding
-will dictate what error analysis is used in the comparison.
-
-* -effort uses an "amount" between 0 and 100 to determine how much additional effort
-to apply during the encoding.
-
-* -errormetric selects the fitting algorithm used by the encoder. "rgba" calculates
-RMS error using RGB components that are weighted by A. "rgbx" calculates RMS error
-using RGBA components, where A is treated as an additional data channel, instead of
-as alpha. "rec709" is similar to "rgba", except the RGB components are also weighted
-according to Rec709. "numeric" calculates RMS error using unweighted RGBA components.
-"normalize" calculates error based on dot product and vector length for RGB and RMS
-error for A.
-
-* -help prints out the usage message
-
-* -jobs enables multi-threading to speed up image encoding
-
-* -normalizexyz normalizes the source RGB to have a length of 1.
-
-* -verbose shows information on the current encoding process. It will then display the
-PSNR and time time it took to encode the image.
-
-* -mipmaps takes an argument that specifies how many mipmaps to generate from the
-source image. The mipmaps are generated with a lanczos3 filter using edge clamping.
-If the mipmaps option is not specified no mipmaps are created.
-
-* -mipwrap takes an argument that specifies the mipmap filter wrap mode. The options
-are "x", "y" and "xy" which specify wrapping in x only, y only or x and y respectively.
-The default options are clamping in both x and y.
-
-Note: Path names can use slashes or backslashes. The tool will convert the
-slashes to the appropriate polarity for the current platform.
-
-
-## API
-
-The library supports two different APIs - a C-like API that is not heavily
-class-based and a class-based API.
-
-main() in EtcTool.cpp contains an example of both APIs.
-
-The Encode() method now returns an EncodingStatus that contains bit flags for
-reporting various warnings and flags encountered when encoding.
-
-
-## Copyright
-Copyright 2015 Etc2Comp Authors.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
+# Etc2Comp - Texture to ETC2 compressor
+
+Etc2Comp is a command line tool that converts textures (e.g. bitmaps)
+into the [ETC2](https://en.wikipedia.org/wiki/Ericsson_Texture_Compression)
+format. The tool is built with a focus on encoding performance
+to reduce the amount of time required to compile asset heavy applications as
+well as reduce overall application size.
+
+This repo provides source code that can be compiled into a binary. The
+binary can then be used to convert textures to the ETC2 format.
+
+Important: This is not an official Google product. It is an experimental
+library published as-is. Please see the CONTRIBUTORS.md file for information
+about questions or issues.
+
+## Setup
+This project uses [CMake](https://cmake.org/) to generate platform-specific
+build files:
+ - Linux: make files
+ - OS X: Xcode workspace files
+ - Microsoft Windows: Visual Studio solution files
+ - Note: CMake supports other formats, but this doc only provides steps for
+ one of each platform for brevity.
+
+Refer to each platform's setup section to setup your environment and build
+an Etc2Comp binary. Then skip to the usage section of this page for examples
+of how to use the library.
+
+### Setup for OS X
+ build tested on this config:
+ OS X 10.9.5 i7 16GB RAM
+ Xcode 5.1.1
+ cmake 3.2.3
+
+Start by downloading and installing the following components if they are not
+already installed on your development machine.
+ - *Xcode* version 5.1.1, or greater
+ - [CMake](https://cmake.org/download/) version 3.2.3, or greater
+
+To build the Etc2Comp binary:
+ 1. Open a *Terminal* window and navigate to the project directory.
+ 1. Run `mkdir build_xcode`
+ 1. Run `cd build_xcode`
+ 1. Run `cmake -G Xcode ../`
+ 1. Open *Xcode* and import the `build_xcode/EtcTest.xcodeproj` file.
+ 1. Open the Product menu and choose Build For -> Running.
+ 1. Once the build succeeds the binary located at `build_xcode/EtcTool/Debug/EtcTool`
+can be executed.
+
+Optional
+Xcode EtcTool ‘Run’ preferences
+note: if the build_xcode/EtcTest.xcodeproj is manually deleted then some Xcode preferences
+will need to be set by hand after cmake is run (these prefs are retained across
+cmake updates if the .xcodeproj is not deleted/removed)
+
+1. Set the active scheme to ‘EtcTool’
+1. Edit the scheme
+1. Select option ‘Run EtcTool’, then tab ‘Arguments’.
+Add this launch argument: ‘-argfile ../../EtcTool/args.txt’
+1. Select tab ‘Options’ and set a custom working directory to: ‘$(SRCROOT)/Build_Xcode/EtcTool’
+
+### SetUp for Windows
+
+1. Open a *Terminal* window and navigate to the project directory.
+1. Run `mkdir build_vs`
+1. Run `cd build_vs`
+1. Run CMAKE, noting what build version you need, and pointing to the parent directory as the source root;
+ For VS 2013 : `cmake -G "Visual Studio 12 2013 Win64" ../`
+ For VS 2015 : `cmake -G "Visual Studio 14 2015 Win64" ../`
+ NOTE: To see what supported Visual Studio outputs there are, run `cmake -G`
+1. open the 'EtcTest' solution
+1. make the 'EtcTool' project the start up project
+1. (optional) in the project properties, under 'Debugging ->command arguments'
+add the argfile textfile thats included in the EtcTool directory.
+example: -argfile C:\etc2\EtcTool\Args.txt
+
+### Setup For Linux
+The Linux build was tested on this config:
+ Ubuntu desktop 14.04
+ gcc/g++ 4.8
+ cmake 2.8.12.2
+
+1. Verify linux has cmake and C++-11 capable g++ installed
+1. Open shell
+1. Run `mkdir build_linux`
+1. Run `cd build_linux`
+1. Run `cmake ../`
+1. Run `make`
+1. navigate to the newly created EtcTool directory `cd EtcTool`
+1. run the executable: `./EtcTool -argfile ../../EtcTool/args.txt`
+
+Skip to the <a href="#usage">Usage</a> section for more information about using the
+tool.
+
+## Usage
+
+### Command Line Usage
+EtcTool can be run from the command line with the following usage:
+ etctool.exe source_image [options ...] -output encoded_image
+
+The encoder will use an array of RGBA floats read from the source_image to create
+an ETC1 or ETC2 encoded image in encoded_image. The RGBA floats should be in the
+range [0:1].
+
+Options:
+
+ -analyze <analysis_folder>
+ -argfile <arg_file> additional command line arguments read from a file
+ -blockAtHV <H V> encodes a single block that contains the
+ pixel specified by the H V coordinates
+ -compare <comparison_image> compares source_image to comparison_image
+ -effort <amount> number between 0 and 100 to specify the encoding quality
+ (100 is the highest quality)
+ -errormetric <error_metric> specify the error metric, the options are
+ rgba, rgbx, rec709, numeric and normalxyz
+ -format <etc_format> ETC1, RGB8, SRGB8, RGBA8, SRGB8, RGB8A1,
+ SRGB8A1 or R11
+ -help prints this message
+ -jobs or -j <thread_count> specifies the number of threads (default=1)
+ -normalizexyz normalize RGB to have a length of 1
+ -verbose or -v shows status information during the encoding
+ process
+ -mipmaps or -m <mip_count> sets the maximum number of mipaps to generate (default=1)
+ -mipwrap or -w <x|y|xy> sets the mipmap filter wrap mode (default=clamp)
+
+* -analyze will run an analysis of the encoding and place it in folder
+"analysis_folder" (e.g. ../analysis/kodim05). within the analysis_folder, a folder
+will be created with a name of the current date/time (e.g. 20151204_153306). this
+date/time folder is used to compare encodings of the same texture over time.
+within the date/time folder is a text file with several encoding stats and a 2x png
+image showing the encoding mode for each 4x4 block.
+
+* -argfile allows additional command line arguments to be placed in a text file
+
+* -blockAtHV selects the 4x4 pixel subset of the source image at position (H,V).
+This is mainly used for debugging
+
+* -compare compares the source image to the created encoded image. The encoding
+will dictate what error analysis is used in the comparison.
+
+* -effort uses an "amount" between 0 and 100 to determine how much additional effort
+to apply during the encoding.
+
+* -errormetric selects the fitting algorithm used by the encoder. "rgba" calculates
+RMS error using RGB components that are weighted by A. "rgbx" calculates RMS error
+using RGBA components, where A is treated as an additional data channel, instead of
+as alpha. "rec709" is similar to "rgba", except the RGB components are also weighted
+according to Rec709. "numeric" calculates RMS error using unweighted RGBA components.
+"normalize" calculates error based on dot product and vector length for RGB and RMS
+error for A.
+
+* -help prints out the usage message
+
+* -jobs enables multi-threading to speed up image encoding
+
+* -normalizexyz normalizes the source RGB to have a length of 1.
+
+* -verbose shows information on the current encoding process. It will then display the
+PSNR and time time it took to encode the image.
+
+* -mipmaps takes an argument that specifies how many mipmaps to generate from the
+source image. The mipmaps are generated with a lanczos3 filter using edge clamping.
+If the mipmaps option is not specified no mipmaps are created.
+
+* -mipwrap takes an argument that specifies the mipmap filter wrap mode. The options
+are "x", "y" and "xy" which specify wrapping in x only, y only or x and y respectively.
+The default options are clamping in both x and y.
+
+Note: Path names can use slashes or backslashes. The tool will convert the
+slashes to the appropriate polarity for the current platform.
+
+
+## API
+
+The library supports two different APIs - a C-like API that is not heavily
+class-based and a class-based API.
+
+main() in EtcTool.cpp contains an example of both APIs.
+
+The Encode() method now returns an EncodingStatus that contains bit flags for
+reporting various warnings and flags encountered when encoding.
+
+
+## Copyright
+Copyright 2015 Etc2Comp Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
diff --git a/thirdparty/libtheora/x86_vc/mmxencfrag.c b/thirdparty/libtheora/x86_vc/mmxencfrag.c
index ac9dacf377..94f1d06513 100644
--- a/thirdparty/libtheora/x86_vc/mmxencfrag.c
+++ b/thirdparty/libtheora/x86_vc/mmxencfrag.c
@@ -1,969 +1,969 @@
-/********************************************************************
- * *
- * THIS FILE IS PART OF THE OggTheora SOFTWARE CODEC SOURCE CODE. *
- * USE, DISTRIBUTION AND REPRODUCTION OF THIS LIBRARY SOURCE IS *
- * GOVERNED BY A BSD-STYLE SOURCE LICENSE INCLUDED WITH THIS SOURCE *
- * IN 'COPYING'. PLEASE READ THESE TERMS BEFORE DISTRIBUTING. *
- * *
- * THE Theora SOURCE CODE IS COPYRIGHT (C) 2002-2009 *
- * by the Xiph.Org Foundation http://www.xiph.org/ *
- * *
- ********************************************************************
-
- function:
- last mod: $Id: dsp_mmx.c 14579 2008-03-12 06:42:40Z xiphmont $
-
- ********************************************************************/
-#include <stddef.h>
-#include "x86enc.h"
-
-#if defined(OC_X86_ASM)
-
-unsigned oc_enc_frag_sad_mmxext(const unsigned char *_src,
- const unsigned char *_ref,int _ystride){
- ptrdiff_t ret;
- __asm{
-#define SRC esi
-#define REF edx
-#define YSTRIDE ecx
-#define YSTRIDE3 edi
- mov YSTRIDE,_ystride
- mov SRC,_src
- mov REF,_ref
- /*Load the first 4 rows of each block.*/
- movq mm0,[SRC]
- movq mm1,[REF]
- movq mm2,[SRC][YSTRIDE]
- movq mm3,[REF][YSTRIDE]
- lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
- movq mm4,[SRC+YSTRIDE*2]
- movq mm5,[REF+YSTRIDE*2]
- movq mm6,[SRC+YSTRIDE3]
- movq mm7,[REF+YSTRIDE3]
- /*Compute their SADs and add them in mm0*/
- psadbw mm0,mm1
- psadbw mm2,mm3
- lea SRC,[SRC+YSTRIDE*4]
- paddw mm0,mm2
- lea REF,[REF+YSTRIDE*4]
- /*Load the next 3 rows as registers become available.*/
- movq mm2,[SRC]
- movq mm3,[REF]
- psadbw mm4,mm5
- psadbw mm6,mm7
- paddw mm0,mm4
- movq mm5,[REF+YSTRIDE]
- movq mm4,[SRC+YSTRIDE]
- paddw mm0,mm6
- movq mm7,[REF+YSTRIDE*2]
- movq mm6,[SRC+YSTRIDE*2]
- /*Start adding their SADs to mm0*/
- psadbw mm2,mm3
- psadbw mm4,mm5
- paddw mm0,mm2
- psadbw mm6,mm7
- /*Load last row as registers become available.*/
- movq mm2,[SRC+YSTRIDE3]
- movq mm3,[REF+YSTRIDE3]
- /*And finish adding up their SADs.*/
- paddw mm0,mm4
- psadbw mm2,mm3
- paddw mm0,mm6
- paddw mm0,mm2
- movd [ret],mm0
-#undef SRC
-#undef REF
-#undef YSTRIDE
-#undef YSTRIDE3
- }
- return (unsigned)ret;
-}
-
-unsigned oc_enc_frag_sad_thresh_mmxext(const unsigned char *_src,
- const unsigned char *_ref,int _ystride,unsigned _thresh){
- /*Early termination is for suckers.*/
- return oc_enc_frag_sad_mmxext(_src,_ref,_ystride);
-}
-
-#define OC_SAD2_LOOP __asm{ \
- /*We want to compute (mm0+mm1>>1) on unsigned bytes without overflow, but \
- pavgb computes (mm0+mm1+1>>1). \
- The latter is exactly 1 too large when the low bit of two corresponding \
- bytes is only set in one of them. \
- Therefore we pxor the operands, pand to mask out the low bits, and psubb to \
- correct the output of pavgb.*/ \
- __asm movq mm6,mm0 \
- __asm lea REF1,[REF1+YSTRIDE*2] \
- __asm pxor mm0,mm1 \
- __asm pavgb mm6,mm1 \
- __asm lea REF2,[REF2+YSTRIDE*2] \
- __asm movq mm1,mm2 \
- __asm pand mm0,mm7 \
- __asm pavgb mm2,mm3 \
- __asm pxor mm1,mm3 \
- __asm movq mm3,[REF2+YSTRIDE] \
- __asm psubb mm6,mm0 \
- __asm movq mm0,[REF1] \
- __asm pand mm1,mm7 \
- __asm psadbw mm4,mm6 \
- __asm movd mm6,RET \
- __asm psubb mm2,mm1 \
- __asm movq mm1,[REF2] \
- __asm lea SRC,[SRC+YSTRIDE*2] \
- __asm psadbw mm5,mm2 \
- __asm movq mm2,[REF1+YSTRIDE] \
- __asm paddw mm5,mm4 \
- __asm movq mm4,[SRC] \
- __asm paddw mm6,mm5 \
- __asm movq mm5,[SRC+YSTRIDE] \
- __asm movd RET,mm6 \
-}
-
-/*Same as above, but does not pre-load the next two rows.*/
-#define OC_SAD2_TAIL __asm{ \
- __asm movq mm6,mm0 \
- __asm pavgb mm0,mm1 \
- __asm pxor mm6,mm1 \
- __asm movq mm1,mm2 \
- __asm pand mm6,mm7 \
- __asm pavgb mm2,mm3 \
- __asm pxor mm1,mm3 \
- __asm psubb mm0,mm6 \
- __asm pand mm1,mm7 \
- __asm psadbw mm4,mm0 \
- __asm psubb mm2,mm1 \
- __asm movd mm6,RET \
- __asm psadbw mm5,mm2 \
- __asm paddw mm5,mm4 \
- __asm paddw mm6,mm5 \
- __asm movd RET,mm6 \
-}
-
-unsigned oc_enc_frag_sad2_thresh_mmxext(const unsigned char *_src,
- const unsigned char *_ref1,const unsigned char *_ref2,int _ystride,
- unsigned _thresh){
- ptrdiff_t ret;
- __asm{
-#define REF1 ecx
-#define REF2 edi
-#define YSTRIDE esi
-#define SRC edx
-#define RET eax
- mov YSTRIDE,_ystride
- mov SRC,_src
- mov REF1,_ref1
- mov REF2,_ref2
- movq mm0,[REF1]
- movq mm1,[REF2]
- movq mm2,[REF1+YSTRIDE]
- movq mm3,[REF2+YSTRIDE]
- xor RET,RET
- movq mm4,[SRC]
- pxor mm7,mm7
- pcmpeqb mm6,mm6
- movq mm5,[SRC+YSTRIDE]
- psubb mm7,mm6
- OC_SAD2_LOOP
- OC_SAD2_LOOP
- OC_SAD2_LOOP
- OC_SAD2_TAIL
- mov [ret],RET
-#undef REF1
-#undef REF2
-#undef YSTRIDE
-#undef SRC
-#undef RET
- }
- return (unsigned)ret;
-}
-
-/*Load an 8x4 array of pixel values from %[src] and %[ref] and compute their
- 16-bit difference in mm0...mm7.*/
-#define OC_LOAD_SUB_8x4(_off) __asm{ \
- __asm movd mm0,[_off+SRC] \
- __asm movd mm4,[_off+REF] \
- __asm movd mm1,[_off+SRC+SRC_YSTRIDE] \
- __asm lea SRC,[SRC+SRC_YSTRIDE*2] \
- __asm movd mm5,[_off+REF+REF_YSTRIDE] \
- __asm lea REF,[REF+REF_YSTRIDE*2] \
- __asm movd mm2,[_off+SRC] \
- __asm movd mm7,[_off+REF] \
- __asm movd mm3,[_off+SRC+SRC_YSTRIDE] \
- __asm movd mm6,[_off+REF+REF_YSTRIDE] \
- __asm punpcklbw mm0,mm4 \
- __asm lea SRC,[SRC+SRC_YSTRIDE*2] \
- __asm punpcklbw mm4,mm4 \
- __asm lea REF,[REF+REF_YSTRIDE*2] \
- __asm psubw mm0,mm4 \
- __asm movd mm4,[_off+SRC] \
- __asm movq [_off*2+BUF],mm0 \
- __asm movd mm0,[_off+REF] \
- __asm punpcklbw mm1,mm5 \
- __asm punpcklbw mm5,mm5 \
- __asm psubw mm1,mm5 \
- __asm movd mm5,[_off+SRC+SRC_YSTRIDE] \
- __asm punpcklbw mm2,mm7 \
- __asm punpcklbw mm7,mm7 \
- __asm psubw mm2,mm7 \
- __asm movd mm7,[_off+REF+REF_YSTRIDE] \
- __asm punpcklbw mm3,mm6 \
- __asm lea SRC,[SRC+SRC_YSTRIDE*2] \
- __asm punpcklbw mm6,mm6 \
- __asm psubw mm3,mm6 \
- __asm movd mm6,[_off+SRC] \
- __asm punpcklbw mm4,mm0 \
- __asm lea REF,[REF+REF_YSTRIDE*2] \
- __asm punpcklbw mm0,mm0 \
- __asm lea SRC,[SRC+SRC_YSTRIDE*2] \
- __asm psubw mm4,mm0 \
- __asm movd mm0,[_off+REF] \
- __asm punpcklbw mm5,mm7 \
- __asm neg SRC_YSTRIDE \
- __asm punpcklbw mm7,mm7 \
- __asm psubw mm5,mm7 \
- __asm movd mm7,[_off+SRC+SRC_YSTRIDE] \
- __asm punpcklbw mm6,mm0 \
- __asm lea REF,[REF+REF_YSTRIDE*2] \
- __asm punpcklbw mm0,mm0 \
- __asm neg REF_YSTRIDE \
- __asm psubw mm6,mm0 \
- __asm movd mm0,[_off+REF+REF_YSTRIDE] \
- __asm lea SRC,[SRC+SRC_YSTRIDE*8] \
- __asm punpcklbw mm7,mm0 \
- __asm neg SRC_YSTRIDE \
- __asm punpcklbw mm0,mm0 \
- __asm lea REF,[REF+REF_YSTRIDE*8] \
- __asm psubw mm7,mm0 \
- __asm neg REF_YSTRIDE \
- __asm movq mm0,[_off*2+BUF] \
-}
-
-/*Load an 8x4 array of pixel values from %[src] into %%mm0...%%mm7.*/
-#define OC_LOAD_8x4(_off) __asm{ \
- __asm movd mm0,[_off+SRC] \
- __asm movd mm1,[_off+SRC+YSTRIDE] \
- __asm movd mm2,[_off+SRC+YSTRIDE*2] \
- __asm pxor mm7,mm7 \
- __asm movd mm3,[_off+SRC+YSTRIDE3] \
- __asm punpcklbw mm0,mm7 \
- __asm movd mm4,[_off+SRC4] \
- __asm punpcklbw mm1,mm7 \
- __asm movd mm5,[_off+SRC4+YSTRIDE] \
- __asm punpcklbw mm2,mm7 \
- __asm movd mm6,[_off+SRC4+YSTRIDE*2] \
- __asm punpcklbw mm3,mm7 \
- __asm movd mm7,[_off+SRC4+YSTRIDE3] \
- __asm punpcklbw mm4,mm4 \
- __asm punpcklbw mm5,mm5 \
- __asm psrlw mm4,8 \
- __asm psrlw mm5,8 \
- __asm punpcklbw mm6,mm6 \
- __asm punpcklbw mm7,mm7 \
- __asm psrlw mm6,8 \
- __asm psrlw mm7,8 \
-}
-
-/*Performs the first two stages of an 8-point 1-D Hadamard transform.
- The transform is performed in place, except that outputs 0-3 are swapped with
- outputs 4-7.
- Outputs 2, 3, 6 and 7 from the second stage are negated (which allows us to
- perform this stage in place with no temporary registers).*/
-#define OC_HADAMARD_AB_8x4 __asm{ \
- /*Stage A: \
- Outputs 0-3 are swapped with 4-7 here.*/ \
- __asm paddw mm5,mm1 \
- __asm paddw mm6,mm2 \
- __asm paddw mm1,mm1 \
- __asm paddw mm2,mm2 \
- __asm psubw mm1,mm5 \
- __asm psubw mm2,mm6 \
- __asm paddw mm7,mm3 \
- __asm paddw mm4,mm0 \
- __asm paddw mm3,mm3 \
- __asm paddw mm0,mm0 \
- __asm psubw mm3,mm7 \
- __asm psubw mm0,mm4 \
- /*Stage B:*/ \
- __asm paddw mm0,mm2 \
- __asm paddw mm1,mm3 \
- __asm paddw mm4,mm6 \
- __asm paddw mm5,mm7 \
- __asm paddw mm2,mm2 \
- __asm paddw mm3,mm3 \
- __asm paddw mm6,mm6 \
- __asm paddw mm7,mm7 \
- __asm psubw mm2,mm0 \
- __asm psubw mm3,mm1 \
- __asm psubw mm6,mm4 \
- __asm psubw mm7,mm5 \
-}
-
-/*Performs the last stage of an 8-point 1-D Hadamard transform in place.
- Ouputs 1, 3, 5, and 7 are negated (which allows us to perform this stage in
- place with no temporary registers).*/
-#define OC_HADAMARD_C_8x4 __asm{ \
- /*Stage C:*/ \
- __asm paddw mm0,mm1 \
- __asm paddw mm2,mm3 \
- __asm paddw mm4,mm5 \
- __asm paddw mm6,mm7 \
- __asm paddw mm1,mm1 \
- __asm paddw mm3,mm3 \
- __asm paddw mm5,mm5 \
- __asm paddw mm7,mm7 \
- __asm psubw mm1,mm0 \
- __asm psubw mm3,mm2 \
- __asm psubw mm5,mm4 \
- __asm psubw mm7,mm6 \
-}
-
-/*Performs an 8-point 1-D Hadamard transform.
- The transform is performed in place, except that outputs 0-3 are swapped with
- outputs 4-7.
- Outputs 1, 2, 5 and 6 are negated (which allows us to perform the transform
- in place with no temporary registers).*/
-#define OC_HADAMARD_8x4 __asm{ \
- OC_HADAMARD_AB_8x4 \
- OC_HADAMARD_C_8x4 \
-}
-
-/*Performs the first part of the final stage of the Hadamard transform and
- summing of absolute values.
- At the end of this part, mm1 will contain the DC coefficient of the
- transform.*/
-#define OC_HADAMARD_C_ABS_ACCUM_A_8x4(_r6,_r7) __asm{ \
- /*We use the fact that \
- (abs(a+b)+abs(a-b))/2=max(abs(a),abs(b)) \
- to merge the final butterfly with the abs and the first stage of \
- accumulation. \
- Thus we can avoid using pabsw, which is not available until SSSE3. \
- Emulating pabsw takes 3 instructions, so the straightforward MMXEXT \
- implementation would be (3+3)*8+7=55 instructions (+4 for spilling \
- registers). \
- Even with pabsw, it would be (3+1)*8+7=39 instructions (with no spills). \
- This implementation is only 26 (+4 for spilling registers).*/ \
- __asm movq [_r7+BUF],mm7 \
- __asm movq [_r6+BUF],mm6 \
- /*mm7={0x7FFF}x4 \
- mm0=max(abs(mm0),abs(mm1))-0x7FFF*/ \
- __asm pcmpeqb mm7,mm7 \
- __asm movq mm6,mm0 \
- __asm psrlw mm7,1 \
- __asm paddw mm6,mm1 \
- __asm pmaxsw mm0,mm1 \
- __asm paddsw mm6,mm7 \
- __asm psubw mm0,mm6 \
- /*mm2=max(abs(mm2),abs(mm3))-0x7FFF \
- mm4=max(abs(mm4),abs(mm5))-0x7FFF*/ \
- __asm movq mm6,mm2 \
- __asm movq mm1,mm4 \
- __asm pmaxsw mm2,mm3 \
- __asm pmaxsw mm4,mm5 \
- __asm paddw mm6,mm3 \
- __asm paddw mm1,mm5 \
- __asm movq mm3,[_r7+BUF] \
-}
-
-/*Performs the second part of the final stage of the Hadamard transform and
- summing of absolute values.*/
-#define OC_HADAMARD_C_ABS_ACCUM_B_8x4(_r6,_r7) __asm{ \
- __asm paddsw mm6,mm7 \
- __asm movq mm5,[_r6+BUF] \
- __asm paddsw mm1,mm7 \
- __asm psubw mm2,mm6 \
- __asm psubw mm4,mm1 \
- /*mm7={1}x4 (needed for the horizontal add that follows) \
- mm0+=mm2+mm4+max(abs(mm3),abs(mm5))-0x7FFF*/ \
- __asm movq mm6,mm3 \
- __asm pmaxsw mm3,mm5 \
- __asm paddw mm0,mm2 \
- __asm paddw mm6,mm5 \
- __asm paddw mm0,mm4 \
- __asm paddsw mm6,mm7 \
- __asm paddw mm0,mm3 \
- __asm psrlw mm7,14 \
- __asm psubw mm0,mm6 \
-}
-
-/*Performs the last stage of an 8-point 1-D Hadamard transform, takes the
- absolute value of each component, and accumulates everything into mm0.
- This is the only portion of SATD which requires MMXEXT (we could use plain
- MMX, but it takes 4 instructions and an extra register to work around the
- lack of a pmaxsw, which is a pretty serious penalty).*/
-#define OC_HADAMARD_C_ABS_ACCUM_8x4(_r6,_r7) __asm{ \
- OC_HADAMARD_C_ABS_ACCUM_A_8x4(_r6,_r7) \
- OC_HADAMARD_C_ABS_ACCUM_B_8x4(_r6,_r7) \
-}
-
-/*Performs an 8-point 1-D Hadamard transform, takes the absolute value of each
- component, and accumulates everything into mm0.
- Note that mm0 will have an extra 4 added to each column, and that after
- removing this value, the remainder will be half the conventional value.*/
-#define OC_HADAMARD_ABS_ACCUM_8x4(_r6,_r7) __asm{ \
- OC_HADAMARD_AB_8x4 \
- OC_HADAMARD_C_ABS_ACCUM_8x4(_r6,_r7) \
-}
-
-/*Performs two 4x4 transposes (mostly) in place.
- On input, {mm0,mm1,mm2,mm3} contains rows {e,f,g,h}, and {mm4,mm5,mm6,mm7}
- contains rows {a,b,c,d}.
- On output, {0x40,0x50,0x60,0x70}+_off+BUF contains {e,f,g,h}^T, and
- {mm4,mm5,mm6,mm7} contains the transposed rows {a,b,c,d}^T.*/
-#define OC_TRANSPOSE_4x4x2(_off) __asm{ \
- /*First 4x4 transpose:*/ \
- __asm movq [0x10+_off+BUF],mm5 \
- /*mm0 = e3 e2 e1 e0 \
- mm1 = f3 f2 f1 f0 \
- mm2 = g3 g2 g1 g0 \
- mm3 = h3 h2 h1 h0*/ \
- __asm movq mm5,mm2 \
- __asm punpcklwd mm2,mm3 \
- __asm punpckhwd mm5,mm3 \
- __asm movq mm3,mm0 \
- __asm punpcklwd mm0,mm1 \
- __asm punpckhwd mm3,mm1 \
- /*mm0 = f1 e1 f0 e0 \
- mm3 = f3 e3 f2 e2 \
- mm2 = h1 g1 h0 g0 \
- mm5 = h3 g3 h2 g2*/ \
- __asm movq mm1,mm0 \
- __asm punpckldq mm0,mm2 \
- __asm punpckhdq mm1,mm2 \
- __asm movq mm2,mm3 \
- __asm punpckhdq mm3,mm5 \
- __asm movq [0x40+_off+BUF],mm0 \
- __asm punpckldq mm2,mm5 \
- /*mm0 = h0 g0 f0 e0 \
- mm1 = h1 g1 f1 e1 \
- mm2 = h2 g2 f2 e2 \
- mm3 = h3 g3 f3 e3*/ \
- __asm movq mm5,[0x10+_off+BUF] \
- /*Second 4x4 transpose:*/ \
- /*mm4 = a3 a2 a1 a0 \
- mm5 = b3 b2 b1 b0 \
- mm6 = c3 c2 c1 c0 \
- mm7 = d3 d2 d1 d0*/ \
- __asm movq mm0,mm6 \
- __asm punpcklwd mm6,mm7 \
- __asm movq [0x50+_off+BUF],mm1 \
- __asm punpckhwd mm0,mm7 \
- __asm movq mm7,mm4 \
- __asm punpcklwd mm4,mm5 \
- __asm movq [0x60+_off+BUF],mm2 \
- __asm punpckhwd mm7,mm5 \
- /*mm4 = b1 a1 b0 a0 \
- mm7 = b3 a3 b2 a2 \
- mm6 = d1 c1 d0 c0 \
- mm0 = d3 c3 d2 c2*/ \
- __asm movq mm5,mm4 \
- __asm punpckldq mm4,mm6 \
- __asm movq [0x70+_off+BUF],mm3 \
- __asm punpckhdq mm5,mm6 \
- __asm movq mm6,mm7 \
- __asm punpckhdq mm7,mm0 \
- __asm punpckldq mm6,mm0 \
- /*mm4 = d0 c0 b0 a0 \
- mm5 = d1 c1 b1 a1 \
- mm6 = d2 c2 b2 a2 \
- mm7 = d3 c3 b3 a3*/ \
-}
-
-static unsigned oc_int_frag_satd_thresh_mmxext(const unsigned char *_src,
- int _src_ystride,const unsigned char *_ref,int _ref_ystride,unsigned _thresh){
- OC_ALIGN8(ogg_int16_t buf[64]);
- ogg_int16_t *bufp;
- unsigned ret1;
- unsigned ret2;
- bufp=buf;
- __asm{
-#define SRC esi
-#define REF eax
-#define SRC_YSTRIDE ecx
-#define REF_YSTRIDE edx
-#define BUF edi
-#define RET eax
-#define RET2 edx
- mov SRC,_src
- mov SRC_YSTRIDE,_src_ystride
- mov REF,_ref
- mov REF_YSTRIDE,_ref_ystride
- mov BUF,bufp
- OC_LOAD_SUB_8x4(0x00)
- OC_HADAMARD_8x4
- OC_TRANSPOSE_4x4x2(0x00)
- /*Finish swapping out this 8x4 block to make room for the next one.
- mm0...mm3 have been swapped out already.*/
- movq [0x00+BUF],mm4
- movq [0x10+BUF],mm5
- movq [0x20+BUF],mm6
- movq [0x30+BUF],mm7
- OC_LOAD_SUB_8x4(0x04)
- OC_HADAMARD_8x4
- OC_TRANSPOSE_4x4x2(0x08)
- /*Here the first 4x4 block of output from the last transpose is the second
- 4x4 block of input for the next transform.
- We have cleverly arranged that it already be in the appropriate place, so
- we only have to do half the loads.*/
- movq mm1,[0x10+BUF]
- movq mm2,[0x20+BUF]
- movq mm3,[0x30+BUF]
- movq mm0,[0x00+BUF]
- OC_HADAMARD_ABS_ACCUM_8x4(0x28,0x38)
- /*Up to this point, everything fit in 16 bits (8 input + 1 for the
- difference + 2*3 for the two 8-point 1-D Hadamards - 1 for the abs - 1
- for the factor of two we dropped + 3 for the vertical accumulation).
- Now we finally have to promote things to dwords.
- We break this part out of OC_HADAMARD_ABS_ACCUM_8x4 to hide the long
- latency of pmaddwd by starting the next series of loads now.*/
- mov RET2,_thresh
- pmaddwd mm0,mm7
- movq mm1,[0x50+BUF]
- movq mm5,[0x58+BUF]
- movq mm4,mm0
- movq mm2,[0x60+BUF]
- punpckhdq mm0,mm0
- movq mm6,[0x68+BUF]
- paddd mm4,mm0
- movq mm3,[0x70+BUF]
- movd RET,mm4
- movq mm7,[0x78+BUF]
- /*The sums produced by OC_HADAMARD_ABS_ACCUM_8x4 each have an extra 4
- added to them, and a factor of two removed; correct the final sum here.*/
- lea RET,[RET+RET-32]
- movq mm0,[0x40+BUF]
- cmp RET,RET2
- movq mm4,[0x48+BUF]
- jae at_end
- OC_HADAMARD_ABS_ACCUM_8x4(0x68,0x78)
- pmaddwd mm0,mm7
- /*There isn't much to stick in here to hide the latency this time, but the
- alternative to pmaddwd is movq->punpcklwd->punpckhwd->paddd, whose
- latency is even worse.*/
- sub RET,32
- movq mm4,mm0
- punpckhdq mm0,mm0
- paddd mm4,mm0
- movd RET2,mm4
- lea RET,[RET+RET2*2]
- align 16
-at_end:
- mov ret1,RET
-#undef SRC
-#undef REF
-#undef SRC_YSTRIDE
-#undef REF_YSTRIDE
-#undef BUF
-#undef RET
-#undef RET2
- }
- return ret1;
-}
-
-unsigned oc_enc_frag_satd_thresh_mmxext(const unsigned char *_src,
- const unsigned char *_ref,int _ystride,unsigned _thresh){
- return oc_int_frag_satd_thresh_mmxext(_src,_ystride,_ref,_ystride,_thresh);
-}
-
-
-/*Our internal implementation of frag_copy2 takes an extra stride parameter so
- we can share code with oc_enc_frag_satd2_thresh_mmxext().*/
-static void oc_int_frag_copy2_mmxext(unsigned char *_dst,int _dst_ystride,
- const unsigned char *_src1,const unsigned char *_src2,int _src_ystride){
- __asm{
- /*Load the first 3 rows.*/
-#define DST_YSTRIDE edi
-#define SRC_YSTRIDE esi
-#define DST eax
-#define SRC1 edx
-#define SRC2 ecx
- mov DST_YSTRIDE,_dst_ystride
- mov SRC_YSTRIDE,_src_ystride
- mov DST,_dst
- mov SRC1,_src1
- mov SRC2,_src2
- movq mm0,[SRC1]
- movq mm1,[SRC2]
- movq mm2,[SRC1+SRC_YSTRIDE]
- lea SRC1,[SRC1+SRC_YSTRIDE*2]
- movq mm3,[SRC2+SRC_YSTRIDE]
- lea SRC2,[SRC2+SRC_YSTRIDE*2]
- pxor mm7,mm7
- movq mm4,[SRC1]
- pcmpeqb mm6,mm6
- movq mm5,[SRC2]
- /*mm7={1}x8.*/
- psubb mm7,mm6
- /*Start averaging mm0 and mm1 into mm6.*/
- movq mm6,mm0
- pxor mm0,mm1
- pavgb mm6,mm1
- /*mm1 is free, start averaging mm3 into mm2 using mm1.*/
- movq mm1,mm2
- pand mm0,mm7
- pavgb mm2,mm3
- pxor mm1,mm3
- /*mm3 is free.*/
- psubb mm6,mm0
- /*mm0 is free, start loading the next row.*/
- movq mm0,[SRC1+SRC_YSTRIDE]
- /*Start averaging mm5 and mm4 using mm3.*/
- movq mm3,mm4
- /*mm6 [row 0] is done; write it out.*/
- movq [DST],mm6
- pand mm1,mm7
- pavgb mm4,mm5
- psubb mm2,mm1
- /*mm1 is free, continue loading the next row.*/
- movq mm1,[SRC2+SRC_YSTRIDE]
- pxor mm3,mm5
- lea SRC1,[SRC1+SRC_YSTRIDE*2]
- /*mm2 [row 1] is done; write it out.*/
- movq [DST+DST_YSTRIDE],mm2
- pand mm3,mm7
- /*Start loading the next row.*/
- movq mm2,[SRC1]
- lea DST,[DST+DST_YSTRIDE*2]
- psubb mm4,mm3
- lea SRC2,[SRC2+SRC_YSTRIDE*2]
- /*mm4 [row 2] is done; write it out.*/
- movq [DST],mm4
- /*Continue loading the next row.*/
- movq mm3,[SRC2]
- /*Start averaging mm0 and mm1 into mm6.*/
- movq mm6,mm0
- pxor mm0,mm1
- /*Start loading the next row.*/
- movq mm4,[SRC1+SRC_YSTRIDE]
- pavgb mm6,mm1
- /*mm1 is free; start averaging mm3 into mm2 using mm1.*/
- movq mm1,mm2
- pand mm0,mm7
- /*Continue loading the next row.*/
- movq mm5,[SRC2+SRC_YSTRIDE]
- pavgb mm2,mm3
- lea SRC1,[SRC1+SRC_YSTRIDE*2]
- pxor mm1,mm3
- /*mm3 is free.*/
- psubb mm6,mm0
- /*mm0 is free, start loading the next row.*/
- movq mm0,[SRC1]
- /*Start averaging mm5 into mm4 using mm3.*/
- movq mm3,mm4
- /*mm6 [row 3] is done; write it out.*/
- movq [DST+DST_YSTRIDE],mm6
- pand mm1,mm7
- lea SRC2,[SRC2+SRC_YSTRIDE*2]
- pavgb mm4,mm5
- lea DST,[DST+DST_YSTRIDE*2]
- psubb mm2,mm1
- /*mm1 is free; continue loading the next row.*/
- movq mm1,[SRC2]
- pxor mm3,mm5
- /*mm2 [row 4] is done; write it out.*/
- movq [DST],mm2
- pand mm3,mm7
- /*Start loading the next row.*/
- movq mm2,[SRC1+SRC_YSTRIDE]
- psubb mm4,mm3
- /*Start averaging mm0 and mm1 into mm6.*/
- movq mm6,mm0
- /*Continue loading the next row.*/
- movq mm3,[SRC2+SRC_YSTRIDE]
- /*mm4 [row 5] is done; write it out.*/
- movq [DST+DST_YSTRIDE],mm4
- pxor mm0,mm1
- pavgb mm6,mm1
- /*mm4 is free; start averaging mm3 into mm2 using mm4.*/
- movq mm4,mm2
- pand mm0,mm7
- pavgb mm2,mm3
- pxor mm4,mm3
- lea DST,[DST+DST_YSTRIDE*2]
- psubb mm6,mm0
- pand mm4,mm7
- /*mm6 [row 6] is done, write it out.*/
- movq [DST],mm6
- psubb mm2,mm4
- /*mm2 [row 7] is done, write it out.*/
- movq [DST+DST_YSTRIDE],mm2
-#undef SRC1
-#undef SRC2
-#undef SRC_YSTRIDE
-#undef DST_YSTRIDE
-#undef DST
- }
-}
-
-unsigned oc_enc_frag_satd2_thresh_mmxext(const unsigned char *_src,
- const unsigned char *_ref1,const unsigned char *_ref2,int _ystride,
- unsigned _thresh){
- OC_ALIGN8(unsigned char ref[64]);
- oc_int_frag_copy2_mmxext(ref,8,_ref1,_ref2,_ystride);
- return oc_int_frag_satd_thresh_mmxext(_src,_ystride,ref,8,_thresh);
-}
-
-unsigned oc_enc_frag_intra_satd_mmxext(const unsigned char *_src,
- int _ystride){
- OC_ALIGN8(ogg_int16_t buf[64]);
- ogg_int16_t *bufp;
- unsigned ret1;
- unsigned ret2;
- bufp=buf;
- __asm{
-#define SRC eax
-#define SRC4 esi
-#define BUF edi
-#define RET eax
-#define RET_WORD ax
-#define RET2 ecx
-#define YSTRIDE edx
-#define YSTRIDE3 ecx
- mov SRC,_src
- mov BUF,bufp
- mov YSTRIDE,_ystride
- /* src4 = src+4*ystride */
- lea SRC4,[SRC+YSTRIDE*4]
- /* ystride3 = 3*ystride */
- lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
- OC_LOAD_8x4(0x00)
- OC_HADAMARD_8x4
- OC_TRANSPOSE_4x4x2(0x00)
- /*Finish swapping out this 8x4 block to make room for the next one.
- mm0...mm3 have been swapped out already.*/
- movq [0x00+BUF],mm4
- movq [0x10+BUF],mm5
- movq [0x20+BUF],mm6
- movq [0x30+BUF],mm7
- OC_LOAD_8x4(0x04)
- OC_HADAMARD_8x4
- OC_TRANSPOSE_4x4x2(0x08)
- /*Here the first 4x4 block of output from the last transpose is the second
- 4x4 block of input for the next transform.
- We have cleverly arranged that it already be in the appropriate place, so
- we only have to do half the loads.*/
- movq mm1,[0x10+BUF]
- movq mm2,[0x20+BUF]
- movq mm3,[0x30+BUF]
- movq mm0,[0x00+BUF]
- /*We split out the stages here so we can save the DC coefficient in the
- middle.*/
- OC_HADAMARD_AB_8x4
- OC_HADAMARD_C_ABS_ACCUM_A_8x4(0x28,0x38)
- movd RET,mm1
- OC_HADAMARD_C_ABS_ACCUM_B_8x4(0x28,0x38)
- /*Up to this point, everything fit in 16 bits (8 input + 1 for the
- difference + 2*3 for the two 8-point 1-D Hadamards - 1 for the abs - 1
- for the factor of two we dropped + 3 for the vertical accumulation).
- Now we finally have to promote things to dwords.
- We break this part out of OC_HADAMARD_ABS_ACCUM_8x4 to hide the long
- latency of pmaddwd by starting the next series of loads now.*/
- pmaddwd mm0,mm7
- movq mm1,[0x50+BUF]
- movq mm5,[0x58+BUF]
- movq mm2,[0x60+BUF]
- movq mm4,mm0
- movq mm6,[0x68+BUF]
- punpckhdq mm0,mm0
- movq mm3,[0x70+BUF]
- paddd mm4,mm0
- movq mm7,[0x78+BUF]
- movd RET2,mm4
- movq mm0,[0x40+BUF]
- movq mm4,[0x48+BUF]
- OC_HADAMARD_ABS_ACCUM_8x4(0x68,0x78)
- pmaddwd mm0,mm7
- /*We assume that the DC coefficient is always positive (which is true,
- because the input to the INTRA transform was not a difference).*/
- movzx RET,RET_WORD
- add RET2,RET2
- sub RET2,RET
- movq mm4,mm0
- punpckhdq mm0,mm0
- paddd mm4,mm0
- movd RET,mm4
- lea RET,[-64+RET2+RET*2]
- mov [ret1],RET
-#undef SRC
-#undef SRC4
-#undef BUF
-#undef RET
-#undef RET_WORD
-#undef RET2
-#undef YSTRIDE
-#undef YSTRIDE3
- }
- return ret1;
-}
-
-void oc_enc_frag_sub_mmx(ogg_int16_t _residue[64],
- const unsigned char *_src, const unsigned char *_ref,int _ystride){
- int i;
- __asm pxor mm7,mm7
- for(i=4;i-->0;){
- __asm{
-#define SRC edx
-#define YSTRIDE esi
-#define RESIDUE eax
-#define REF ecx
- mov YSTRIDE,_ystride
- mov RESIDUE,_residue
- mov SRC,_src
- mov REF,_ref
- /*mm0=[src]*/
- movq mm0,[SRC]
- /*mm1=[ref]*/
- movq mm1,[REF]
- /*mm4=[src+ystride]*/
- movq mm4,[SRC+YSTRIDE]
- /*mm5=[ref+ystride]*/
- movq mm5,[REF+YSTRIDE]
- /*Compute [src]-[ref].*/
- movq mm2,mm0
- punpcklbw mm0,mm7
- movq mm3,mm1
- punpckhbw mm2,mm7
- punpcklbw mm1,mm7
- punpckhbw mm3,mm7
- psubw mm0,mm1
- psubw mm2,mm3
- /*Compute [src+ystride]-[ref+ystride].*/
- movq mm1,mm4
- punpcklbw mm4,mm7
- movq mm3,mm5
- punpckhbw mm1,mm7
- lea SRC,[SRC+YSTRIDE*2]
- punpcklbw mm5,mm7
- lea REF,[REF+YSTRIDE*2]
- punpckhbw mm3,mm7
- psubw mm4,mm5
- psubw mm1,mm3
- /*Write the answer out.*/
- movq [RESIDUE+0x00],mm0
- movq [RESIDUE+0x08],mm2
- movq [RESIDUE+0x10],mm4
- movq [RESIDUE+0x18],mm1
- lea RESIDUE,[RESIDUE+0x20]
- mov _residue,RESIDUE
- mov _src,SRC
- mov _ref,REF
-#undef SRC
-#undef YSTRIDE
-#undef RESIDUE
-#undef REF
- }
- }
-}
-
-void oc_enc_frag_sub_128_mmx(ogg_int16_t _residue[64],
- const unsigned char *_src,int _ystride){
- __asm{
-#define YSTRIDE edx
-#define YSTRIDE3 edi
-#define RESIDUE ecx
-#define SRC eax
- mov YSTRIDE,_ystride
- mov RESIDUE,_residue
- mov SRC,_src
- /*mm0=[src]*/
- movq mm0,[SRC]
- /*mm1=[src+ystride]*/
- movq mm1,[SRC+YSTRIDE]
- /*mm6={-1}x4*/
- pcmpeqw mm6,mm6
- /*mm2=[src+2*ystride]*/
- movq mm2,[SRC+YSTRIDE*2]
- /*[ystride3]=3*[ystride]*/
- lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
- /*mm6={1}x4*/
- psllw mm6,15
- /*mm3=[src+3*ystride]*/
- movq mm3,[SRC+YSTRIDE3]
- /*mm6={128}x4*/
- psrlw mm6,8
- /*mm7=0*/
- pxor mm7,mm7
- /*[src]=[src]+4*[ystride]*/
- lea SRC,[SRC+YSTRIDE*4]
- /*Compute [src]-128 and [src+ystride]-128*/
- movq mm4,mm0
- punpcklbw mm0,mm7
- movq mm5,mm1
- punpckhbw mm4,mm7
- psubw mm0,mm6
- punpcklbw mm1,mm7
- psubw mm4,mm6
- punpckhbw mm5,mm7
- psubw mm1,mm6
- psubw mm5,mm6
- /*Write the answer out.*/
- movq [RESIDUE+0x00],mm0
- movq [RESIDUE+0x08],mm4
- movq [RESIDUE+0x10],mm1
- movq [RESIDUE+0x18],mm5
- /*mm0=[src+4*ystride]*/
- movq mm0,[SRC]
- /*mm1=[src+5*ystride]*/
- movq mm1,[SRC+YSTRIDE]
- /*Compute [src+2*ystride]-128 and [src+3*ystride]-128*/
- movq mm4,mm2
- punpcklbw mm2,mm7
- movq mm5,mm3
- punpckhbw mm4,mm7
- psubw mm2,mm6
- punpcklbw mm3,mm7
- psubw mm4,mm6
- punpckhbw mm5,mm7
- psubw mm3,mm6
- psubw mm5,mm6
- /*Write the answer out.*/
- movq [RESIDUE+0x20],mm2
- movq [RESIDUE+0x28],mm4
- movq [RESIDUE+0x30],mm3
- movq [RESIDUE+0x38],mm5
- /*Compute [src+6*ystride]-128 and [src+7*ystride]-128*/
- movq mm2,[SRC+YSTRIDE*2]
- movq mm3,[SRC+YSTRIDE3]
- movq mm4,mm0
- punpcklbw mm0,mm7
- movq mm5,mm1
- punpckhbw mm4,mm7
- psubw mm0,mm6
- punpcklbw mm1,mm7
- psubw mm4,mm6
- punpckhbw mm5,mm7
- psubw mm1,mm6
- psubw mm5,mm6
- /*Write the answer out.*/
- movq [RESIDUE+0x40],mm0
- movq [RESIDUE+0x48],mm4
- movq [RESIDUE+0x50],mm1
- movq [RESIDUE+0x58],mm5
- /*Compute [src+6*ystride]-128 and [src+7*ystride]-128*/
- movq mm4,mm2
- punpcklbw mm2,mm7
- movq mm5,mm3
- punpckhbw mm4,mm7
- psubw mm2,mm6
- punpcklbw mm3,mm7
- psubw mm4,mm6
- punpckhbw mm5,mm7
- psubw mm3,mm6
- psubw mm5,mm6
- /*Write the answer out.*/
- movq [RESIDUE+0x60],mm2
- movq [RESIDUE+0x68],mm4
- movq [RESIDUE+0x70],mm3
- movq [RESIDUE+0x78],mm5
-#undef YSTRIDE
-#undef YSTRIDE3
-#undef RESIDUE
-#undef SRC
- }
-}
-
-void oc_enc_frag_copy2_mmxext(unsigned char *_dst,
- const unsigned char *_src1,const unsigned char *_src2,int _ystride){
- oc_int_frag_copy2_mmxext(_dst,_ystride,_src1,_src2,_ystride);
-}
-
-#endif
+/********************************************************************
+ * *
+ * THIS FILE IS PART OF THE OggTheora SOFTWARE CODEC SOURCE CODE. *
+ * USE, DISTRIBUTION AND REPRODUCTION OF THIS LIBRARY SOURCE IS *
+ * GOVERNED BY A BSD-STYLE SOURCE LICENSE INCLUDED WITH THIS SOURCE *
+ * IN 'COPYING'. PLEASE READ THESE TERMS BEFORE DISTRIBUTING. *
+ * *
+ * THE Theora SOURCE CODE IS COPYRIGHT (C) 2002-2009 *
+ * by the Xiph.Org Foundation http://www.xiph.org/ *
+ * *
+ ********************************************************************
+
+ function:
+ last mod: $Id: dsp_mmx.c 14579 2008-03-12 06:42:40Z xiphmont $
+
+ ********************************************************************/
+#include <stddef.h>
+#include "x86enc.h"
+
+#if defined(OC_X86_ASM)
+
+unsigned oc_enc_frag_sad_mmxext(const unsigned char *_src,
+ const unsigned char *_ref,int _ystride){
+ ptrdiff_t ret;
+ __asm{
+#define SRC esi
+#define REF edx
+#define YSTRIDE ecx
+#define YSTRIDE3 edi
+ mov YSTRIDE,_ystride
+ mov SRC,_src
+ mov REF,_ref
+ /*Load the first 4 rows of each block.*/
+ movq mm0,[SRC]
+ movq mm1,[REF]
+ movq mm2,[SRC][YSTRIDE]
+ movq mm3,[REF][YSTRIDE]
+ lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
+ movq mm4,[SRC+YSTRIDE*2]
+ movq mm5,[REF+YSTRIDE*2]
+ movq mm6,[SRC+YSTRIDE3]
+ movq mm7,[REF+YSTRIDE3]
+ /*Compute their SADs and add them in mm0*/
+ psadbw mm0,mm1
+ psadbw mm2,mm3
+ lea SRC,[SRC+YSTRIDE*4]
+ paddw mm0,mm2
+ lea REF,[REF+YSTRIDE*4]
+ /*Load the next 3 rows as registers become available.*/
+ movq mm2,[SRC]
+ movq mm3,[REF]
+ psadbw mm4,mm5
+ psadbw mm6,mm7
+ paddw mm0,mm4
+ movq mm5,[REF+YSTRIDE]
+ movq mm4,[SRC+YSTRIDE]
+ paddw mm0,mm6
+ movq mm7,[REF+YSTRIDE*2]
+ movq mm6,[SRC+YSTRIDE*2]
+ /*Start adding their SADs to mm0*/
+ psadbw mm2,mm3
+ psadbw mm4,mm5
+ paddw mm0,mm2
+ psadbw mm6,mm7
+ /*Load last row as registers become available.*/
+ movq mm2,[SRC+YSTRIDE3]
+ movq mm3,[REF+YSTRIDE3]
+ /*And finish adding up their SADs.*/
+ paddw mm0,mm4
+ psadbw mm2,mm3
+ paddw mm0,mm6
+ paddw mm0,mm2
+ movd [ret],mm0
+#undef SRC
+#undef REF
+#undef YSTRIDE
+#undef YSTRIDE3
+ }
+ return (unsigned)ret;
+}
+
+unsigned oc_enc_frag_sad_thresh_mmxext(const unsigned char *_src,
+ const unsigned char *_ref,int _ystride,unsigned _thresh){
+ /*Early termination is for suckers.*/
+ return oc_enc_frag_sad_mmxext(_src,_ref,_ystride);
+}
+
+#define OC_SAD2_LOOP __asm{ \
+ /*We want to compute (mm0+mm1>>1) on unsigned bytes without overflow, but \
+ pavgb computes (mm0+mm1+1>>1). \
+ The latter is exactly 1 too large when the low bit of two corresponding \
+ bytes is only set in one of them. \
+ Therefore we pxor the operands, pand to mask out the low bits, and psubb to \
+ correct the output of pavgb.*/ \
+ __asm movq mm6,mm0 \
+ __asm lea REF1,[REF1+YSTRIDE*2] \
+ __asm pxor mm0,mm1 \
+ __asm pavgb mm6,mm1 \
+ __asm lea REF2,[REF2+YSTRIDE*2] \
+ __asm movq mm1,mm2 \
+ __asm pand mm0,mm7 \
+ __asm pavgb mm2,mm3 \
+ __asm pxor mm1,mm3 \
+ __asm movq mm3,[REF2+YSTRIDE] \
+ __asm psubb mm6,mm0 \
+ __asm movq mm0,[REF1] \
+ __asm pand mm1,mm7 \
+ __asm psadbw mm4,mm6 \
+ __asm movd mm6,RET \
+ __asm psubb mm2,mm1 \
+ __asm movq mm1,[REF2] \
+ __asm lea SRC,[SRC+YSTRIDE*2] \
+ __asm psadbw mm5,mm2 \
+ __asm movq mm2,[REF1+YSTRIDE] \
+ __asm paddw mm5,mm4 \
+ __asm movq mm4,[SRC] \
+ __asm paddw mm6,mm5 \
+ __asm movq mm5,[SRC+YSTRIDE] \
+ __asm movd RET,mm6 \
+}
+
+/*Same as above, but does not pre-load the next two rows.*/
+#define OC_SAD2_TAIL __asm{ \
+ __asm movq mm6,mm0 \
+ __asm pavgb mm0,mm1 \
+ __asm pxor mm6,mm1 \
+ __asm movq mm1,mm2 \
+ __asm pand mm6,mm7 \
+ __asm pavgb mm2,mm3 \
+ __asm pxor mm1,mm3 \
+ __asm psubb mm0,mm6 \
+ __asm pand mm1,mm7 \
+ __asm psadbw mm4,mm0 \
+ __asm psubb mm2,mm1 \
+ __asm movd mm6,RET \
+ __asm psadbw mm5,mm2 \
+ __asm paddw mm5,mm4 \
+ __asm paddw mm6,mm5 \
+ __asm movd RET,mm6 \
+}
+
+unsigned oc_enc_frag_sad2_thresh_mmxext(const unsigned char *_src,
+ const unsigned char *_ref1,const unsigned char *_ref2,int _ystride,
+ unsigned _thresh){
+ ptrdiff_t ret;
+ __asm{
+#define REF1 ecx
+#define REF2 edi
+#define YSTRIDE esi
+#define SRC edx
+#define RET eax
+ mov YSTRIDE,_ystride
+ mov SRC,_src
+ mov REF1,_ref1
+ mov REF2,_ref2
+ movq mm0,[REF1]
+ movq mm1,[REF2]
+ movq mm2,[REF1+YSTRIDE]
+ movq mm3,[REF2+YSTRIDE]
+ xor RET,RET
+ movq mm4,[SRC]
+ pxor mm7,mm7
+ pcmpeqb mm6,mm6
+ movq mm5,[SRC+YSTRIDE]
+ psubb mm7,mm6
+ OC_SAD2_LOOP
+ OC_SAD2_LOOP
+ OC_SAD2_LOOP
+ OC_SAD2_TAIL
+ mov [ret],RET
+#undef REF1
+#undef REF2
+#undef YSTRIDE
+#undef SRC
+#undef RET
+ }
+ return (unsigned)ret;
+}
+
+/*Load an 8x4 array of pixel values from %[src] and %[ref] and compute their
+ 16-bit difference in mm0...mm7.*/
+#define OC_LOAD_SUB_8x4(_off) __asm{ \
+ __asm movd mm0,[_off+SRC] \
+ __asm movd mm4,[_off+REF] \
+ __asm movd mm1,[_off+SRC+SRC_YSTRIDE] \
+ __asm lea SRC,[SRC+SRC_YSTRIDE*2] \
+ __asm movd mm5,[_off+REF+REF_YSTRIDE] \
+ __asm lea REF,[REF+REF_YSTRIDE*2] \
+ __asm movd mm2,[_off+SRC] \
+ __asm movd mm7,[_off+REF] \
+ __asm movd mm3,[_off+SRC+SRC_YSTRIDE] \
+ __asm movd mm6,[_off+REF+REF_YSTRIDE] \
+ __asm punpcklbw mm0,mm4 \
+ __asm lea SRC,[SRC+SRC_YSTRIDE*2] \
+ __asm punpcklbw mm4,mm4 \
+ __asm lea REF,[REF+REF_YSTRIDE*2] \
+ __asm psubw mm0,mm4 \
+ __asm movd mm4,[_off+SRC] \
+ __asm movq [_off*2+BUF],mm0 \
+ __asm movd mm0,[_off+REF] \
+ __asm punpcklbw mm1,mm5 \
+ __asm punpcklbw mm5,mm5 \
+ __asm psubw mm1,mm5 \
+ __asm movd mm5,[_off+SRC+SRC_YSTRIDE] \
+ __asm punpcklbw mm2,mm7 \
+ __asm punpcklbw mm7,mm7 \
+ __asm psubw mm2,mm7 \
+ __asm movd mm7,[_off+REF+REF_YSTRIDE] \
+ __asm punpcklbw mm3,mm6 \
+ __asm lea SRC,[SRC+SRC_YSTRIDE*2] \
+ __asm punpcklbw mm6,mm6 \
+ __asm psubw mm3,mm6 \
+ __asm movd mm6,[_off+SRC] \
+ __asm punpcklbw mm4,mm0 \
+ __asm lea REF,[REF+REF_YSTRIDE*2] \
+ __asm punpcklbw mm0,mm0 \
+ __asm lea SRC,[SRC+SRC_YSTRIDE*2] \
+ __asm psubw mm4,mm0 \
+ __asm movd mm0,[_off+REF] \
+ __asm punpcklbw mm5,mm7 \
+ __asm neg SRC_YSTRIDE \
+ __asm punpcklbw mm7,mm7 \
+ __asm psubw mm5,mm7 \
+ __asm movd mm7,[_off+SRC+SRC_YSTRIDE] \
+ __asm punpcklbw mm6,mm0 \
+ __asm lea REF,[REF+REF_YSTRIDE*2] \
+ __asm punpcklbw mm0,mm0 \
+ __asm neg REF_YSTRIDE \
+ __asm psubw mm6,mm0 \
+ __asm movd mm0,[_off+REF+REF_YSTRIDE] \
+ __asm lea SRC,[SRC+SRC_YSTRIDE*8] \
+ __asm punpcklbw mm7,mm0 \
+ __asm neg SRC_YSTRIDE \
+ __asm punpcklbw mm0,mm0 \
+ __asm lea REF,[REF+REF_YSTRIDE*8] \
+ __asm psubw mm7,mm0 \
+ __asm neg REF_YSTRIDE \
+ __asm movq mm0,[_off*2+BUF] \
+}
+
+/*Load an 8x4 array of pixel values from %[src] into %%mm0...%%mm7.*/
+#define OC_LOAD_8x4(_off) __asm{ \
+ __asm movd mm0,[_off+SRC] \
+ __asm movd mm1,[_off+SRC+YSTRIDE] \
+ __asm movd mm2,[_off+SRC+YSTRIDE*2] \
+ __asm pxor mm7,mm7 \
+ __asm movd mm3,[_off+SRC+YSTRIDE3] \
+ __asm punpcklbw mm0,mm7 \
+ __asm movd mm4,[_off+SRC4] \
+ __asm punpcklbw mm1,mm7 \
+ __asm movd mm5,[_off+SRC4+YSTRIDE] \
+ __asm punpcklbw mm2,mm7 \
+ __asm movd mm6,[_off+SRC4+YSTRIDE*2] \
+ __asm punpcklbw mm3,mm7 \
+ __asm movd mm7,[_off+SRC4+YSTRIDE3] \
+ __asm punpcklbw mm4,mm4 \
+ __asm punpcklbw mm5,mm5 \
+ __asm psrlw mm4,8 \
+ __asm psrlw mm5,8 \
+ __asm punpcklbw mm6,mm6 \
+ __asm punpcklbw mm7,mm7 \
+ __asm psrlw mm6,8 \
+ __asm psrlw mm7,8 \
+}
+
+/*Performs the first two stages of an 8-point 1-D Hadamard transform.
+ The transform is performed in place, except that outputs 0-3 are swapped with
+ outputs 4-7.
+ Outputs 2, 3, 6 and 7 from the second stage are negated (which allows us to
+ perform this stage in place with no temporary registers).*/
+#define OC_HADAMARD_AB_8x4 __asm{ \
+ /*Stage A: \
+ Outputs 0-3 are swapped with 4-7 here.*/ \
+ __asm paddw mm5,mm1 \
+ __asm paddw mm6,mm2 \
+ __asm paddw mm1,mm1 \
+ __asm paddw mm2,mm2 \
+ __asm psubw mm1,mm5 \
+ __asm psubw mm2,mm6 \
+ __asm paddw mm7,mm3 \
+ __asm paddw mm4,mm0 \
+ __asm paddw mm3,mm3 \
+ __asm paddw mm0,mm0 \
+ __asm psubw mm3,mm7 \
+ __asm psubw mm0,mm4 \
+ /*Stage B:*/ \
+ __asm paddw mm0,mm2 \
+ __asm paddw mm1,mm3 \
+ __asm paddw mm4,mm6 \
+ __asm paddw mm5,mm7 \
+ __asm paddw mm2,mm2 \
+ __asm paddw mm3,mm3 \
+ __asm paddw mm6,mm6 \
+ __asm paddw mm7,mm7 \
+ __asm psubw mm2,mm0 \
+ __asm psubw mm3,mm1 \
+ __asm psubw mm6,mm4 \
+ __asm psubw mm7,mm5 \
+}
+
+/*Performs the last stage of an 8-point 1-D Hadamard transform in place.
+ Ouputs 1, 3, 5, and 7 are negated (which allows us to perform this stage in
+ place with no temporary registers).*/
+#define OC_HADAMARD_C_8x4 __asm{ \
+ /*Stage C:*/ \
+ __asm paddw mm0,mm1 \
+ __asm paddw mm2,mm3 \
+ __asm paddw mm4,mm5 \
+ __asm paddw mm6,mm7 \
+ __asm paddw mm1,mm1 \
+ __asm paddw mm3,mm3 \
+ __asm paddw mm5,mm5 \
+ __asm paddw mm7,mm7 \
+ __asm psubw mm1,mm0 \
+ __asm psubw mm3,mm2 \
+ __asm psubw mm5,mm4 \
+ __asm psubw mm7,mm6 \
+}
+
+/*Performs an 8-point 1-D Hadamard transform.
+ The transform is performed in place, except that outputs 0-3 are swapped with
+ outputs 4-7.
+ Outputs 1, 2, 5 and 6 are negated (which allows us to perform the transform
+ in place with no temporary registers).*/
+#define OC_HADAMARD_8x4 __asm{ \
+ OC_HADAMARD_AB_8x4 \
+ OC_HADAMARD_C_8x4 \
+}
+
+/*Performs the first part of the final stage of the Hadamard transform and
+ summing of absolute values.
+ At the end of this part, mm1 will contain the DC coefficient of the
+ transform.*/
+#define OC_HADAMARD_C_ABS_ACCUM_A_8x4(_r6,_r7) __asm{ \
+ /*We use the fact that \
+ (abs(a+b)+abs(a-b))/2=max(abs(a),abs(b)) \
+ to merge the final butterfly with the abs and the first stage of \
+ accumulation. \
+ Thus we can avoid using pabsw, which is not available until SSSE3. \
+ Emulating pabsw takes 3 instructions, so the straightforward MMXEXT \
+ implementation would be (3+3)*8+7=55 instructions (+4 for spilling \
+ registers). \
+ Even with pabsw, it would be (3+1)*8+7=39 instructions (with no spills). \
+ This implementation is only 26 (+4 for spilling registers).*/ \
+ __asm movq [_r7+BUF],mm7 \
+ __asm movq [_r6+BUF],mm6 \
+ /*mm7={0x7FFF}x4 \
+ mm0=max(abs(mm0),abs(mm1))-0x7FFF*/ \
+ __asm pcmpeqb mm7,mm7 \
+ __asm movq mm6,mm0 \
+ __asm psrlw mm7,1 \
+ __asm paddw mm6,mm1 \
+ __asm pmaxsw mm0,mm1 \
+ __asm paddsw mm6,mm7 \
+ __asm psubw mm0,mm6 \
+ /*mm2=max(abs(mm2),abs(mm3))-0x7FFF \
+ mm4=max(abs(mm4),abs(mm5))-0x7FFF*/ \
+ __asm movq mm6,mm2 \
+ __asm movq mm1,mm4 \
+ __asm pmaxsw mm2,mm3 \
+ __asm pmaxsw mm4,mm5 \
+ __asm paddw mm6,mm3 \
+ __asm paddw mm1,mm5 \
+ __asm movq mm3,[_r7+BUF] \
+}
+
+/*Performs the second part of the final stage of the Hadamard transform and
+ summing of absolute values.*/
+#define OC_HADAMARD_C_ABS_ACCUM_B_8x4(_r6,_r7) __asm{ \
+ __asm paddsw mm6,mm7 \
+ __asm movq mm5,[_r6+BUF] \
+ __asm paddsw mm1,mm7 \
+ __asm psubw mm2,mm6 \
+ __asm psubw mm4,mm1 \
+ /*mm7={1}x4 (needed for the horizontal add that follows) \
+ mm0+=mm2+mm4+max(abs(mm3),abs(mm5))-0x7FFF*/ \
+ __asm movq mm6,mm3 \
+ __asm pmaxsw mm3,mm5 \
+ __asm paddw mm0,mm2 \
+ __asm paddw mm6,mm5 \
+ __asm paddw mm0,mm4 \
+ __asm paddsw mm6,mm7 \
+ __asm paddw mm0,mm3 \
+ __asm psrlw mm7,14 \
+ __asm psubw mm0,mm6 \
+}
+
+/*Performs the last stage of an 8-point 1-D Hadamard transform, takes the
+ absolute value of each component, and accumulates everything into mm0.
+ This is the only portion of SATD which requires MMXEXT (we could use plain
+ MMX, but it takes 4 instructions and an extra register to work around the
+ lack of a pmaxsw, which is a pretty serious penalty).*/
+#define OC_HADAMARD_C_ABS_ACCUM_8x4(_r6,_r7) __asm{ \
+ OC_HADAMARD_C_ABS_ACCUM_A_8x4(_r6,_r7) \
+ OC_HADAMARD_C_ABS_ACCUM_B_8x4(_r6,_r7) \
+}
+
+/*Performs an 8-point 1-D Hadamard transform, takes the absolute value of each
+ component, and accumulates everything into mm0.
+ Note that mm0 will have an extra 4 added to each column, and that after
+ removing this value, the remainder will be half the conventional value.*/
+#define OC_HADAMARD_ABS_ACCUM_8x4(_r6,_r7) __asm{ \
+ OC_HADAMARD_AB_8x4 \
+ OC_HADAMARD_C_ABS_ACCUM_8x4(_r6,_r7) \
+}
+
+/*Performs two 4x4 transposes (mostly) in place.
+ On input, {mm0,mm1,mm2,mm3} contains rows {e,f,g,h}, and {mm4,mm5,mm6,mm7}
+ contains rows {a,b,c,d}.
+ On output, {0x40,0x50,0x60,0x70}+_off+BUF contains {e,f,g,h}^T, and
+ {mm4,mm5,mm6,mm7} contains the transposed rows {a,b,c,d}^T.*/
+#define OC_TRANSPOSE_4x4x2(_off) __asm{ \
+ /*First 4x4 transpose:*/ \
+ __asm movq [0x10+_off+BUF],mm5 \
+ /*mm0 = e3 e2 e1 e0 \
+ mm1 = f3 f2 f1 f0 \
+ mm2 = g3 g2 g1 g0 \
+ mm3 = h3 h2 h1 h0*/ \
+ __asm movq mm5,mm2 \
+ __asm punpcklwd mm2,mm3 \
+ __asm punpckhwd mm5,mm3 \
+ __asm movq mm3,mm0 \
+ __asm punpcklwd mm0,mm1 \
+ __asm punpckhwd mm3,mm1 \
+ /*mm0 = f1 e1 f0 e0 \
+ mm3 = f3 e3 f2 e2 \
+ mm2 = h1 g1 h0 g0 \
+ mm5 = h3 g3 h2 g2*/ \
+ __asm movq mm1,mm0 \
+ __asm punpckldq mm0,mm2 \
+ __asm punpckhdq mm1,mm2 \
+ __asm movq mm2,mm3 \
+ __asm punpckhdq mm3,mm5 \
+ __asm movq [0x40+_off+BUF],mm0 \
+ __asm punpckldq mm2,mm5 \
+ /*mm0 = h0 g0 f0 e0 \
+ mm1 = h1 g1 f1 e1 \
+ mm2 = h2 g2 f2 e2 \
+ mm3 = h3 g3 f3 e3*/ \
+ __asm movq mm5,[0x10+_off+BUF] \
+ /*Second 4x4 transpose:*/ \
+ /*mm4 = a3 a2 a1 a0 \
+ mm5 = b3 b2 b1 b0 \
+ mm6 = c3 c2 c1 c0 \
+ mm7 = d3 d2 d1 d0*/ \
+ __asm movq mm0,mm6 \
+ __asm punpcklwd mm6,mm7 \
+ __asm movq [0x50+_off+BUF],mm1 \
+ __asm punpckhwd mm0,mm7 \
+ __asm movq mm7,mm4 \
+ __asm punpcklwd mm4,mm5 \
+ __asm movq [0x60+_off+BUF],mm2 \
+ __asm punpckhwd mm7,mm5 \
+ /*mm4 = b1 a1 b0 a0 \
+ mm7 = b3 a3 b2 a2 \
+ mm6 = d1 c1 d0 c0 \
+ mm0 = d3 c3 d2 c2*/ \
+ __asm movq mm5,mm4 \
+ __asm punpckldq mm4,mm6 \
+ __asm movq [0x70+_off+BUF],mm3 \
+ __asm punpckhdq mm5,mm6 \
+ __asm movq mm6,mm7 \
+ __asm punpckhdq mm7,mm0 \
+ __asm punpckldq mm6,mm0 \
+ /*mm4 = d0 c0 b0 a0 \
+ mm5 = d1 c1 b1 a1 \
+ mm6 = d2 c2 b2 a2 \
+ mm7 = d3 c3 b3 a3*/ \
+}
+
+static unsigned oc_int_frag_satd_thresh_mmxext(const unsigned char *_src,
+ int _src_ystride,const unsigned char *_ref,int _ref_ystride,unsigned _thresh){
+ OC_ALIGN8(ogg_int16_t buf[64]);
+ ogg_int16_t *bufp;
+ unsigned ret1;
+ unsigned ret2;
+ bufp=buf;
+ __asm{
+#define SRC esi
+#define REF eax
+#define SRC_YSTRIDE ecx
+#define REF_YSTRIDE edx
+#define BUF edi
+#define RET eax
+#define RET2 edx
+ mov SRC,_src
+ mov SRC_YSTRIDE,_src_ystride
+ mov REF,_ref
+ mov REF_YSTRIDE,_ref_ystride
+ mov BUF,bufp
+ OC_LOAD_SUB_8x4(0x00)
+ OC_HADAMARD_8x4
+ OC_TRANSPOSE_4x4x2(0x00)
+ /*Finish swapping out this 8x4 block to make room for the next one.
+ mm0...mm3 have been swapped out already.*/
+ movq [0x00+BUF],mm4
+ movq [0x10+BUF],mm5
+ movq [0x20+BUF],mm6
+ movq [0x30+BUF],mm7
+ OC_LOAD_SUB_8x4(0x04)
+ OC_HADAMARD_8x4
+ OC_TRANSPOSE_4x4x2(0x08)
+ /*Here the first 4x4 block of output from the last transpose is the second
+ 4x4 block of input for the next transform.
+ We have cleverly arranged that it already be in the appropriate place, so
+ we only have to do half the loads.*/
+ movq mm1,[0x10+BUF]
+ movq mm2,[0x20+BUF]
+ movq mm3,[0x30+BUF]
+ movq mm0,[0x00+BUF]
+ OC_HADAMARD_ABS_ACCUM_8x4(0x28,0x38)
+ /*Up to this point, everything fit in 16 bits (8 input + 1 for the
+ difference + 2*3 for the two 8-point 1-D Hadamards - 1 for the abs - 1
+ for the factor of two we dropped + 3 for the vertical accumulation).
+ Now we finally have to promote things to dwords.
+ We break this part out of OC_HADAMARD_ABS_ACCUM_8x4 to hide the long
+ latency of pmaddwd by starting the next series of loads now.*/
+ mov RET2,_thresh
+ pmaddwd mm0,mm7
+ movq mm1,[0x50+BUF]
+ movq mm5,[0x58+BUF]
+ movq mm4,mm0
+ movq mm2,[0x60+BUF]
+ punpckhdq mm0,mm0
+ movq mm6,[0x68+BUF]
+ paddd mm4,mm0
+ movq mm3,[0x70+BUF]
+ movd RET,mm4
+ movq mm7,[0x78+BUF]
+ /*The sums produced by OC_HADAMARD_ABS_ACCUM_8x4 each have an extra 4
+ added to them, and a factor of two removed; correct the final sum here.*/
+ lea RET,[RET+RET-32]
+ movq mm0,[0x40+BUF]
+ cmp RET,RET2
+ movq mm4,[0x48+BUF]
+ jae at_end
+ OC_HADAMARD_ABS_ACCUM_8x4(0x68,0x78)
+ pmaddwd mm0,mm7
+ /*There isn't much to stick in here to hide the latency this time, but the
+ alternative to pmaddwd is movq->punpcklwd->punpckhwd->paddd, whose
+ latency is even worse.*/
+ sub RET,32
+ movq mm4,mm0
+ punpckhdq mm0,mm0
+ paddd mm4,mm0
+ movd RET2,mm4
+ lea RET,[RET+RET2*2]
+ align 16
+at_end:
+ mov ret1,RET
+#undef SRC
+#undef REF
+#undef SRC_YSTRIDE
+#undef REF_YSTRIDE
+#undef BUF
+#undef RET
+#undef RET2
+ }
+ return ret1;
+}
+
+unsigned oc_enc_frag_satd_thresh_mmxext(const unsigned char *_src,
+ const unsigned char *_ref,int _ystride,unsigned _thresh){
+ return oc_int_frag_satd_thresh_mmxext(_src,_ystride,_ref,_ystride,_thresh);
+}
+
+
+/*Our internal implementation of frag_copy2 takes an extra stride parameter so
+ we can share code with oc_enc_frag_satd2_thresh_mmxext().*/
+static void oc_int_frag_copy2_mmxext(unsigned char *_dst,int _dst_ystride,
+ const unsigned char *_src1,const unsigned char *_src2,int _src_ystride){
+ __asm{
+ /*Load the first 3 rows.*/
+#define DST_YSTRIDE edi
+#define SRC_YSTRIDE esi
+#define DST eax
+#define SRC1 edx
+#define SRC2 ecx
+ mov DST_YSTRIDE,_dst_ystride
+ mov SRC_YSTRIDE,_src_ystride
+ mov DST,_dst
+ mov SRC1,_src1
+ mov SRC2,_src2
+ movq mm0,[SRC1]
+ movq mm1,[SRC2]
+ movq mm2,[SRC1+SRC_YSTRIDE]
+ lea SRC1,[SRC1+SRC_YSTRIDE*2]
+ movq mm3,[SRC2+SRC_YSTRIDE]
+ lea SRC2,[SRC2+SRC_YSTRIDE*2]
+ pxor mm7,mm7
+ movq mm4,[SRC1]
+ pcmpeqb mm6,mm6
+ movq mm5,[SRC2]
+ /*mm7={1}x8.*/
+ psubb mm7,mm6
+ /*Start averaging mm0 and mm1 into mm6.*/
+ movq mm6,mm0
+ pxor mm0,mm1
+ pavgb mm6,mm1
+ /*mm1 is free, start averaging mm3 into mm2 using mm1.*/
+ movq mm1,mm2
+ pand mm0,mm7
+ pavgb mm2,mm3
+ pxor mm1,mm3
+ /*mm3 is free.*/
+ psubb mm6,mm0
+ /*mm0 is free, start loading the next row.*/
+ movq mm0,[SRC1+SRC_YSTRIDE]
+ /*Start averaging mm5 and mm4 using mm3.*/
+ movq mm3,mm4
+ /*mm6 [row 0] is done; write it out.*/
+ movq [DST],mm6
+ pand mm1,mm7
+ pavgb mm4,mm5
+ psubb mm2,mm1
+ /*mm1 is free, continue loading the next row.*/
+ movq mm1,[SRC2+SRC_YSTRIDE]
+ pxor mm3,mm5
+ lea SRC1,[SRC1+SRC_YSTRIDE*2]
+ /*mm2 [row 1] is done; write it out.*/
+ movq [DST+DST_YSTRIDE],mm2
+ pand mm3,mm7
+ /*Start loading the next row.*/
+ movq mm2,[SRC1]
+ lea DST,[DST+DST_YSTRIDE*2]
+ psubb mm4,mm3
+ lea SRC2,[SRC2+SRC_YSTRIDE*2]
+ /*mm4 [row 2] is done; write it out.*/
+ movq [DST],mm4
+ /*Continue loading the next row.*/
+ movq mm3,[SRC2]
+ /*Start averaging mm0 and mm1 into mm6.*/
+ movq mm6,mm0
+ pxor mm0,mm1
+ /*Start loading the next row.*/
+ movq mm4,[SRC1+SRC_YSTRIDE]
+ pavgb mm6,mm1
+ /*mm1 is free; start averaging mm3 into mm2 using mm1.*/
+ movq mm1,mm2
+ pand mm0,mm7
+ /*Continue loading the next row.*/
+ movq mm5,[SRC2+SRC_YSTRIDE]
+ pavgb mm2,mm3
+ lea SRC1,[SRC1+SRC_YSTRIDE*2]
+ pxor mm1,mm3
+ /*mm3 is free.*/
+ psubb mm6,mm0
+ /*mm0 is free, start loading the next row.*/
+ movq mm0,[SRC1]
+ /*Start averaging mm5 into mm4 using mm3.*/
+ movq mm3,mm4
+ /*mm6 [row 3] is done; write it out.*/
+ movq [DST+DST_YSTRIDE],mm6
+ pand mm1,mm7
+ lea SRC2,[SRC2+SRC_YSTRIDE*2]
+ pavgb mm4,mm5
+ lea DST,[DST+DST_YSTRIDE*2]
+ psubb mm2,mm1
+ /*mm1 is free; continue loading the next row.*/
+ movq mm1,[SRC2]
+ pxor mm3,mm5
+ /*mm2 [row 4] is done; write it out.*/
+ movq [DST],mm2
+ pand mm3,mm7
+ /*Start loading the next row.*/
+ movq mm2,[SRC1+SRC_YSTRIDE]
+ psubb mm4,mm3
+ /*Start averaging mm0 and mm1 into mm6.*/
+ movq mm6,mm0
+ /*Continue loading the next row.*/
+ movq mm3,[SRC2+SRC_YSTRIDE]
+ /*mm4 [row 5] is done; write it out.*/
+ movq [DST+DST_YSTRIDE],mm4
+ pxor mm0,mm1
+ pavgb mm6,mm1
+ /*mm4 is free; start averaging mm3 into mm2 using mm4.*/
+ movq mm4,mm2
+ pand mm0,mm7
+ pavgb mm2,mm3
+ pxor mm4,mm3
+ lea DST,[DST+DST_YSTRIDE*2]
+ psubb mm6,mm0
+ pand mm4,mm7
+ /*mm6 [row 6] is done, write it out.*/
+ movq [DST],mm6
+ psubb mm2,mm4
+ /*mm2 [row 7] is done, write it out.*/
+ movq [DST+DST_YSTRIDE],mm2
+#undef SRC1
+#undef SRC2
+#undef SRC_YSTRIDE
+#undef DST_YSTRIDE
+#undef DST
+ }
+}
+
+unsigned oc_enc_frag_satd2_thresh_mmxext(const unsigned char *_src,
+ const unsigned char *_ref1,const unsigned char *_ref2,int _ystride,
+ unsigned _thresh){
+ OC_ALIGN8(unsigned char ref[64]);
+ oc_int_frag_copy2_mmxext(ref,8,_ref1,_ref2,_ystride);
+ return oc_int_frag_satd_thresh_mmxext(_src,_ystride,ref,8,_thresh);
+}
+
+unsigned oc_enc_frag_intra_satd_mmxext(const unsigned char *_src,
+ int _ystride){
+ OC_ALIGN8(ogg_int16_t buf[64]);
+ ogg_int16_t *bufp;
+ unsigned ret1;
+ unsigned ret2;
+ bufp=buf;
+ __asm{
+#define SRC eax
+#define SRC4 esi
+#define BUF edi
+#define RET eax
+#define RET_WORD ax
+#define RET2 ecx
+#define YSTRIDE edx
+#define YSTRIDE3 ecx
+ mov SRC,_src
+ mov BUF,bufp
+ mov YSTRIDE,_ystride
+ /* src4 = src+4*ystride */
+ lea SRC4,[SRC+YSTRIDE*4]
+ /* ystride3 = 3*ystride */
+ lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
+ OC_LOAD_8x4(0x00)
+ OC_HADAMARD_8x4
+ OC_TRANSPOSE_4x4x2(0x00)
+ /*Finish swapping out this 8x4 block to make room for the next one.
+ mm0...mm3 have been swapped out already.*/
+ movq [0x00+BUF],mm4
+ movq [0x10+BUF],mm5
+ movq [0x20+BUF],mm6
+ movq [0x30+BUF],mm7
+ OC_LOAD_8x4(0x04)
+ OC_HADAMARD_8x4
+ OC_TRANSPOSE_4x4x2(0x08)
+ /*Here the first 4x4 block of output from the last transpose is the second
+ 4x4 block of input for the next transform.
+ We have cleverly arranged that it already be in the appropriate place, so
+ we only have to do half the loads.*/
+ movq mm1,[0x10+BUF]
+ movq mm2,[0x20+BUF]
+ movq mm3,[0x30+BUF]
+ movq mm0,[0x00+BUF]
+ /*We split out the stages here so we can save the DC coefficient in the
+ middle.*/
+ OC_HADAMARD_AB_8x4
+ OC_HADAMARD_C_ABS_ACCUM_A_8x4(0x28,0x38)
+ movd RET,mm1
+ OC_HADAMARD_C_ABS_ACCUM_B_8x4(0x28,0x38)
+ /*Up to this point, everything fit in 16 bits (8 input + 1 for the
+ difference + 2*3 for the two 8-point 1-D Hadamards - 1 for the abs - 1
+ for the factor of two we dropped + 3 for the vertical accumulation).
+ Now we finally have to promote things to dwords.
+ We break this part out of OC_HADAMARD_ABS_ACCUM_8x4 to hide the long
+ latency of pmaddwd by starting the next series of loads now.*/
+ pmaddwd mm0,mm7
+ movq mm1,[0x50+BUF]
+ movq mm5,[0x58+BUF]
+ movq mm2,[0x60+BUF]
+ movq mm4,mm0
+ movq mm6,[0x68+BUF]
+ punpckhdq mm0,mm0
+ movq mm3,[0x70+BUF]
+ paddd mm4,mm0
+ movq mm7,[0x78+BUF]
+ movd RET2,mm4
+ movq mm0,[0x40+BUF]
+ movq mm4,[0x48+BUF]
+ OC_HADAMARD_ABS_ACCUM_8x4(0x68,0x78)
+ pmaddwd mm0,mm7
+ /*We assume that the DC coefficient is always positive (which is true,
+ because the input to the INTRA transform was not a difference).*/
+ movzx RET,RET_WORD
+ add RET2,RET2
+ sub RET2,RET
+ movq mm4,mm0
+ punpckhdq mm0,mm0
+ paddd mm4,mm0
+ movd RET,mm4
+ lea RET,[-64+RET2+RET*2]
+ mov [ret1],RET
+#undef SRC
+#undef SRC4
+#undef BUF
+#undef RET
+#undef RET_WORD
+#undef RET2
+#undef YSTRIDE
+#undef YSTRIDE3
+ }
+ return ret1;
+}
+
+void oc_enc_frag_sub_mmx(ogg_int16_t _residue[64],
+ const unsigned char *_src, const unsigned char *_ref,int _ystride){
+ int i;
+ __asm pxor mm7,mm7
+ for(i=4;i-->0;){
+ __asm{
+#define SRC edx
+#define YSTRIDE esi
+#define RESIDUE eax
+#define REF ecx
+ mov YSTRIDE,_ystride
+ mov RESIDUE,_residue
+ mov SRC,_src
+ mov REF,_ref
+ /*mm0=[src]*/
+ movq mm0,[SRC]
+ /*mm1=[ref]*/
+ movq mm1,[REF]
+ /*mm4=[src+ystride]*/
+ movq mm4,[SRC+YSTRIDE]
+ /*mm5=[ref+ystride]*/
+ movq mm5,[REF+YSTRIDE]
+ /*Compute [src]-[ref].*/
+ movq mm2,mm0
+ punpcklbw mm0,mm7
+ movq mm3,mm1
+ punpckhbw mm2,mm7
+ punpcklbw mm1,mm7
+ punpckhbw mm3,mm7
+ psubw mm0,mm1
+ psubw mm2,mm3
+ /*Compute [src+ystride]-[ref+ystride].*/
+ movq mm1,mm4
+ punpcklbw mm4,mm7
+ movq mm3,mm5
+ punpckhbw mm1,mm7
+ lea SRC,[SRC+YSTRIDE*2]
+ punpcklbw mm5,mm7
+ lea REF,[REF+YSTRIDE*2]
+ punpckhbw mm3,mm7
+ psubw mm4,mm5
+ psubw mm1,mm3
+ /*Write the answer out.*/
+ movq [RESIDUE+0x00],mm0
+ movq [RESIDUE+0x08],mm2
+ movq [RESIDUE+0x10],mm4
+ movq [RESIDUE+0x18],mm1
+ lea RESIDUE,[RESIDUE+0x20]
+ mov _residue,RESIDUE
+ mov _src,SRC
+ mov _ref,REF
+#undef SRC
+#undef YSTRIDE
+#undef RESIDUE
+#undef REF
+ }
+ }
+}
+
+void oc_enc_frag_sub_128_mmx(ogg_int16_t _residue[64],
+ const unsigned char *_src,int _ystride){
+ __asm{
+#define YSTRIDE edx
+#define YSTRIDE3 edi
+#define RESIDUE ecx
+#define SRC eax
+ mov YSTRIDE,_ystride
+ mov RESIDUE,_residue
+ mov SRC,_src
+ /*mm0=[src]*/
+ movq mm0,[SRC]
+ /*mm1=[src+ystride]*/
+ movq mm1,[SRC+YSTRIDE]
+ /*mm6={-1}x4*/
+ pcmpeqw mm6,mm6
+ /*mm2=[src+2*ystride]*/
+ movq mm2,[SRC+YSTRIDE*2]
+ /*[ystride3]=3*[ystride]*/
+ lea YSTRIDE3,[YSTRIDE+YSTRIDE*2]
+ /*mm6={1}x4*/
+ psllw mm6,15
+ /*mm3=[src+3*ystride]*/
+ movq mm3,[SRC+YSTRIDE3]
+ /*mm6={128}x4*/
+ psrlw mm6,8
+ /*mm7=0*/
+ pxor mm7,mm7
+ /*[src]=[src]+4*[ystride]*/
+ lea SRC,[SRC+YSTRIDE*4]
+ /*Compute [src]-128 and [src+ystride]-128*/
+ movq mm4,mm0
+ punpcklbw mm0,mm7
+ movq mm5,mm1
+ punpckhbw mm4,mm7
+ psubw mm0,mm6
+ punpcklbw mm1,mm7
+ psubw mm4,mm6
+ punpckhbw mm5,mm7
+ psubw mm1,mm6
+ psubw mm5,mm6
+ /*Write the answer out.*/
+ movq [RESIDUE+0x00],mm0
+ movq [RESIDUE+0x08],mm4
+ movq [RESIDUE+0x10],mm1
+ movq [RESIDUE+0x18],mm5
+ /*mm0=[src+4*ystride]*/
+ movq mm0,[SRC]
+ /*mm1=[src+5*ystride]*/
+ movq mm1,[SRC+YSTRIDE]
+ /*Compute [src+2*ystride]-128 and [src+3*ystride]-128*/
+ movq mm4,mm2
+ punpcklbw mm2,mm7
+ movq mm5,mm3
+ punpckhbw mm4,mm7
+ psubw mm2,mm6
+ punpcklbw mm3,mm7
+ psubw mm4,mm6
+ punpckhbw mm5,mm7
+ psubw mm3,mm6
+ psubw mm5,mm6
+ /*Write the answer out.*/
+ movq [RESIDUE+0x20],mm2
+ movq [RESIDUE+0x28],mm4
+ movq [RESIDUE+0x30],mm3
+ movq [RESIDUE+0x38],mm5
+ /*Compute [src+6*ystride]-128 and [src+7*ystride]-128*/
+ movq mm2,[SRC+YSTRIDE*2]
+ movq mm3,[SRC+YSTRIDE3]
+ movq mm4,mm0
+ punpcklbw mm0,mm7
+ movq mm5,mm1
+ punpckhbw mm4,mm7
+ psubw mm0,mm6
+ punpcklbw mm1,mm7
+ psubw mm4,mm6
+ punpckhbw mm5,mm7
+ psubw mm1,mm6
+ psubw mm5,mm6
+ /*Write the answer out.*/
+ movq [RESIDUE+0x40],mm0
+ movq [RESIDUE+0x48],mm4
+ movq [RESIDUE+0x50],mm1
+ movq [RESIDUE+0x58],mm5
+ /*Compute [src+6*ystride]-128 and [src+7*ystride]-128*/
+ movq mm4,mm2
+ punpcklbw mm2,mm7
+ movq mm5,mm3
+ punpckhbw mm4,mm7
+ psubw mm2,mm6
+ punpcklbw mm3,mm7
+ psubw mm4,mm6
+ punpckhbw mm5,mm7
+ psubw mm3,mm6
+ psubw mm5,mm6
+ /*Write the answer out.*/
+ movq [RESIDUE+0x60],mm2
+ movq [RESIDUE+0x68],mm4
+ movq [RESIDUE+0x70],mm3
+ movq [RESIDUE+0x78],mm5
+#undef YSTRIDE
+#undef YSTRIDE3
+#undef RESIDUE
+#undef SRC
+ }
+}
+
+void oc_enc_frag_copy2_mmxext(unsigned char *_dst,
+ const unsigned char *_src1,const unsigned char *_src2,int _ystride){
+ oc_int_frag_copy2_mmxext(_dst,_ystride,_src1,_src2,_ystride);
+}
+
+#endif
diff --git a/thirdparty/libtheora/x86_vc/mmxfdct.c b/thirdparty/libtheora/x86_vc/mmxfdct.c
index dcf17c9fa7..d908ce2413 100644
--- a/thirdparty/libtheora/x86_vc/mmxfdct.c
+++ b/thirdparty/libtheora/x86_vc/mmxfdct.c
@@ -1,670 +1,670 @@
-/********************************************************************
- * *
- * THIS FILE IS PART OF THE OggTheora SOFTWARE CODEC SOURCE CODE. *
- * USE, DISTRIBUTION AND REPRODUCTION OF THIS LIBRARY SOURCE IS *
- * GOVERNED BY A BSD-STYLE SOURCE LICENSE INCLUDED WITH THIS SOURCE *
- * IN 'COPYING'. PLEASE READ THESE TERMS BEFORE DISTRIBUTING. *
- * *
- * THE Theora SOURCE CODE IS COPYRIGHT (C) 1999-2006 *
- * by the Xiph.Org Foundation http://www.xiph.org/ *
- * *
- ********************************************************************/
- /*MMX fDCT implementation for x86_32*/
-/*$Id: fdct_ses2.c 14579 2008-03-12 06:42:40Z xiphmont $*/
-#include "x86enc.h"
-
-#if defined(OC_X86_ASM)
-
-#define OC_FDCT_STAGE1_8x4 __asm{ \
- /*Stage 1:*/ \
- /*mm0=t7'=t0-t7*/ \
- __asm psubw mm0,mm7 \
- __asm paddw mm7,mm7 \
- /*mm1=t6'=t1-t6*/ \
- __asm psubw mm1, mm6 \
- __asm paddw mm6,mm6 \
- /*mm2=t5'=t2-t5*/ \
- __asm psubw mm2,mm5 \
- __asm paddw mm5,mm5 \
- /*mm3=t4'=t3-t4*/ \
- __asm psubw mm3,mm4 \
- __asm paddw mm4,mm4 \
- /*mm7=t0'=t0+t7*/ \
- __asm paddw mm7,mm0 \
- /*mm6=t1'=t1+t6*/ \
- __asm paddw mm6,mm1 \
- /*mm5=t2'=t2+t5*/ \
- __asm paddw mm5,mm2 \
- /*mm4=t3'=t3+t4*/ \
- __asm paddw mm4,mm3\
-}
-
-#define OC_FDCT8x4(_r0,_r1,_r2,_r3,_r4,_r5,_r6,_r7) __asm{ \
- /*Stage 2:*/ \
- /*mm7=t3''=t0'-t3'*/ \
- __asm psubw mm7,mm4 \
- __asm paddw mm4,mm4 \
- /*mm6=t2''=t1'-t2'*/ \
- __asm psubw mm6,mm5 \
- __asm movq [Y+_r6],mm7 \
- __asm paddw mm5,mm5 \
- /*mm1=t5''=t6'-t5'*/ \
- __asm psubw mm1,mm2 \
- __asm movq [Y+_r2],mm6 \
- /*mm4=t0''=t0'+t3'*/ \
- __asm paddw mm4,mm7 \
- __asm paddw mm2,mm2 \
- /*mm5=t1''=t1'+t2'*/ \
- __asm movq [Y+_r0],mm4 \
- __asm paddw mm5,mm6 \
- /*mm2=t6''=t6'+t5'*/ \
- __asm paddw mm2,mm1 \
- __asm movq [Y+_r4],mm5 \
- /*mm0=t7', mm1=t5'', mm2=t6'', mm3=t4'.*/ \
- /*mm4, mm5, mm6, mm7 are free.*/ \
- /*Stage 3:*/ \
- /*mm6={2}x4, mm7={27146,0xB500>>1}x2*/ \
- __asm mov A,0x5A806A0A \
- __asm pcmpeqb mm6,mm6 \
- __asm movd mm7,A \
- __asm psrlw mm6,15 \
- __asm punpckldq mm7,mm7 \
- __asm paddw mm6,mm6 \
- /*mm0=0, m2={-1}x4 \
- mm5:mm4=t5''*27146+0xB500*/ \
- __asm movq mm4,mm1 \
- __asm movq mm5,mm1 \
- __asm punpcklwd mm4,mm6 \
- __asm movq [Y+_r3],mm2 \
- __asm pmaddwd mm4,mm7 \
- __asm movq [Y+_r7],mm0 \
- __asm punpckhwd mm5,mm6 \
- __asm pxor mm0,mm0 \
- __asm pmaddwd mm5,mm7 \
- __asm pcmpeqb mm2,mm2 \
- /*mm2=t6'', mm1=t5''+(t5''!=0) \
- mm4=(t5''*27146+0xB500>>16)*/ \
- __asm pcmpeqw mm0,mm1 \
- __asm psrad mm4,16 \
- __asm psubw mm0,mm2 \
- __asm movq mm2, [Y+_r3] \
- __asm psrad mm5,16 \
- __asm paddw mm1,mm0 \
- __asm packssdw mm4,mm5 \
- /*mm4=s=(t5''*27146+0xB500>>16)+t5''+(t5''!=0)>>1*/ \
- __asm paddw mm4,mm1 \
- __asm movq mm0, [Y+_r7] \
- __asm psraw mm4,1 \
- __asm movq mm1,mm3 \
- /*mm3=t4''=t4'+s*/ \
- __asm paddw mm3,mm4 \
- /*mm1=t5'''=t4'-s*/ \
- __asm psubw mm1,mm4 \
- /*mm1=0, mm3={-1}x4 \
- mm5:mm4=t6''*27146+0xB500*/ \
- __asm movq mm4,mm2 \
- __asm movq mm5,mm2 \
- __asm punpcklwd mm4,mm6 \
- __asm movq [Y+_r5],mm1 \
- __asm pmaddwd mm4,mm7 \
- __asm movq [Y+_r1],mm3 \
- __asm punpckhwd mm5,mm6 \
- __asm pxor mm1,mm1 \
- __asm pmaddwd mm5,mm7 \
- __asm pcmpeqb mm3,mm3 \
- /*mm2=t6''+(t6''!=0), mm4=(t6''*27146+0xB500>>16)*/ \
- __asm psrad mm4,16 \
- __asm pcmpeqw mm1,mm2 \
- __asm psrad mm5,16 \
- __asm psubw mm1,mm3 \
- __asm packssdw mm4,mm5 \
- __asm paddw mm2,mm1 \
- /*mm1=t1'' \
- mm4=s=(t6''*27146+0xB500>>16)+t6''+(t6''!=0)>>1*/ \
- __asm paddw mm4,mm2 \
- __asm movq mm1,[Y+_r4] \
- __asm psraw mm4,1 \
- __asm movq mm2,mm0 \
- /*mm7={54491-0x7FFF,0x7FFF}x2 \
- mm0=t7''=t7'+s*/ \
- __asm paddw mm0,mm4 \
- /*mm2=t6'''=t7'-s*/ \
- __asm psubw mm2,mm4 \
- /*Stage 4:*/ \
- /*mm0=0, mm2=t0'' \
- mm5:mm4=t1''*27146+0xB500*/ \
- __asm movq mm4,mm1 \
- __asm movq mm5,mm1 \
- __asm punpcklwd mm4,mm6 \
- __asm movq [Y+_r3],mm2 \
- __asm pmaddwd mm4,mm7 \
- __asm movq mm2,[Y+_r0] \
- __asm punpckhwd mm5,mm6 \
- __asm movq [Y+_r7],mm0 \
- __asm pmaddwd mm5,mm7 \
- __asm pxor mm0,mm0 \
- /*mm7={27146,0x4000>>1}x2 \
- mm0=s=(t1''*27146+0xB500>>16)+t1''+(t1''!=0)*/ \
- __asm psrad mm4,16 \
- __asm mov A,0x20006A0A \
- __asm pcmpeqw mm0,mm1 \
- __asm movd mm7,A \
- __asm psrad mm5,16 \
- __asm psubw mm0,mm3 \
- __asm packssdw mm4,mm5 \
- __asm paddw mm0,mm1 \
- __asm punpckldq mm7,mm7 \
- __asm paddw mm0,mm4 \
- /*mm6={0x00000E3D}x2 \
- mm1=-(t0''==0), mm5:mm4=t0''*27146+0x4000*/ \
- __asm movq mm4,mm2 \
- __asm movq mm5,mm2 \
- __asm punpcklwd mm4,mm6 \
- __asm mov A,0x0E3D \
- __asm pmaddwd mm4,mm7 \
- __asm punpckhwd mm5,mm6 \
- __asm movd mm6,A \
- __asm pmaddwd mm5,mm7 \
- __asm pxor mm1,mm1 \
- __asm punpckldq mm6,mm6 \
- __asm pcmpeqw mm1,mm2 \
- /*mm4=r=(t0''*27146+0x4000>>16)+t0''+(t0''!=0)*/ \
- __asm psrad mm4,16 \
- __asm psubw mm1,mm3 \
- __asm psrad mm5,16 \
- __asm paddw mm2,mm1 \
- __asm packssdw mm4,mm5 \
- __asm movq mm1,[Y+_r5] \
- __asm paddw mm4,mm2 \
- /*mm2=t6'', mm0=_y[0]=u=r+s>>1 \
- The naive implementation could cause overflow, so we use \
- u=(r&s)+((r^s)>>1).*/ \
- __asm movq mm2,[Y+_r3] \
- __asm movq mm7,mm0 \
- __asm pxor mm0,mm4 \
- __asm pand mm7,mm4 \
- __asm psraw mm0,1 \
- __asm mov A,0x7FFF54DC \
- __asm paddw mm0,mm7 \
- __asm movd mm7,A \
- /*mm7={54491-0x7FFF,0x7FFF}x2 \
- mm4=_y[4]=v=r-u*/ \
- __asm psubw mm4,mm0 \
- __asm punpckldq mm7,mm7 \
- __asm movq [Y+_r4],mm4 \
- /*mm0=0, mm7={36410}x4 \
- mm1=(t5'''!=0), mm5:mm4=54491*t5'''+0x0E3D*/ \
- __asm movq mm4,mm1 \
- __asm movq mm5,mm1 \
- __asm punpcklwd mm4,mm1 \
- __asm mov A,0x8E3A8E3A \
- __asm pmaddwd mm4,mm7 \
- __asm movq [Y+_r0],mm0 \
- __asm punpckhwd mm5,mm1 \
- __asm pxor mm0,mm0 \
- __asm pmaddwd mm5,mm7 \
- __asm pcmpeqw mm1,mm0 \
- __asm movd mm7,A \
- __asm psubw mm1,mm3 \
- __asm punpckldq mm7,mm7 \
- __asm paddd mm4,mm6 \
- __asm paddd mm5,mm6 \
- /*mm0=0 \
- mm3:mm1=36410*t6'''+((t5'''!=0)<<16)*/ \
- __asm movq mm6,mm2 \
- __asm movq mm3,mm2 \
- __asm pmulhw mm6,mm7 \
- __asm paddw mm1,mm2 \
- __asm pmullw mm3,mm7 \
- __asm pxor mm0,mm0 \
- __asm paddw mm6,mm1 \
- __asm movq mm1,mm3 \
- __asm punpckhwd mm3,mm6 \
- __asm punpcklwd mm1,mm6 \
- /*mm3={-1}x4, mm6={1}x4 \
- mm4=_y[5]=u=(54491*t5'''+36410*t6'''+0x0E3D>>16)+(t5'''!=0)*/ \
- __asm paddd mm5,mm3 \
- __asm paddd mm4,mm1 \
- __asm psrad mm5,16 \
- __asm pxor mm6,mm6 \
- __asm psrad mm4,16 \
- __asm pcmpeqb mm3,mm3 \
- __asm packssdw mm4,mm5 \
- __asm psubw mm6,mm3 \
- /*mm1=t7'', mm7={26568,0x3400}x2 \
- mm2=s=t6'''-(36410*u>>16)*/ \
- __asm movq mm1,mm4 \
- __asm mov A,0x340067C8 \
- __asm pmulhw mm4,mm7 \
- __asm movd mm7,A \
- __asm movq [Y+_r5],mm1 \
- __asm punpckldq mm7,mm7 \
- __asm paddw mm4,mm1 \
- __asm movq mm1,[Y+_r7] \
- __asm psubw mm2,mm4 \
- /*mm6={0x00007B1B}x2 \
- mm0=(s!=0), mm5:mm4=s*26568+0x3400*/ \
- __asm movq mm4,mm2 \
- __asm movq mm5,mm2 \
- __asm punpcklwd mm4,mm6 \
- __asm pcmpeqw mm0,mm2 \
- __asm pmaddwd mm4,mm7 \
- __asm mov A,0x7B1B \
- __asm punpckhwd mm5,mm6 \
- __asm movd mm6,A \
- __asm pmaddwd mm5,mm7 \
- __asm psubw mm0,mm3 \
- __asm punpckldq mm6,mm6 \
- /*mm7={64277-0x7FFF,0x7FFF}x2 \
- mm2=_y[3]=v=(s*26568+0x3400>>17)+s+(s!=0)*/ \
- __asm psrad mm4,17 \
- __asm paddw mm2,mm0 \
- __asm psrad mm5,17 \
- __asm mov A,0x7FFF7B16 \
- __asm packssdw mm4,mm5 \
- __asm movd mm7,A \
- __asm paddw mm2,mm4 \
- __asm punpckldq mm7,mm7 \
- /*mm0=0, mm7={12785}x4 \
- mm1=(t7''!=0), mm2=t4'', mm5:mm4=64277*t7''+0x7B1B*/ \
- __asm movq mm4,mm1 \
- __asm movq mm5,mm1 \
- __asm movq [Y+_r3],mm2 \
- __asm punpcklwd mm4,mm1 \
- __asm movq mm2,[Y+_r1] \
- __asm pmaddwd mm4,mm7 \
- __asm mov A,0x31F131F1 \
- __asm punpckhwd mm5,mm1 \
- __asm pxor mm0,mm0 \
- __asm pmaddwd mm5,mm7 \
- __asm pcmpeqw mm1,mm0 \
- __asm movd mm7,A \
- __asm psubw mm1,mm3 \
- __asm punpckldq mm7,mm7 \
- __asm paddd mm4,mm6 \
- __asm paddd mm5,mm6 \
- /*mm3:mm1=12785*t4'''+((t7''!=0)<<16)*/ \
- __asm movq mm6,mm2 \
- __asm movq mm3,mm2 \
- __asm pmulhw mm6,mm7 \
- __asm pmullw mm3,mm7 \
- __asm paddw mm6,mm1 \
- __asm movq mm1,mm3 \
- __asm punpckhwd mm3,mm6 \
- __asm punpcklwd mm1,mm6 \
- /*mm3={-1}x4, mm6={1}x4 \
- mm4=_y[1]=u=(12785*t4'''+64277*t7''+0x7B1B>>16)+(t7''!=0)*/ \
- __asm paddd mm5,mm3 \
- __asm paddd mm4,mm1 \
- __asm psrad mm5,16 \
- __asm pxor mm6,mm6 \
- __asm psrad mm4,16 \
- __asm pcmpeqb mm3,mm3 \
- __asm packssdw mm4,mm5 \
- __asm psubw mm6,mm3 \
- /*mm1=t3'', mm7={20539,0x3000}x2 \
- mm4=s=(12785*u>>16)-t4''*/ \
- __asm movq [Y+_r1],mm4 \
- __asm pmulhw mm4,mm7 \
- __asm mov A,0x3000503B \
- __asm movq mm1,[Y+_r6] \
- __asm movd mm7,A \
- __asm psubw mm4,mm2 \
- __asm punpckldq mm7,mm7 \
- /*mm6={0x00006CB7}x2 \
- mm0=(s!=0), mm5:mm4=s*20539+0x3000*/ \
- __asm movq mm5,mm4 \
- __asm movq mm2,mm4 \
- __asm punpcklwd mm4,mm6 \
- __asm pcmpeqw mm0,mm2 \
- __asm pmaddwd mm4,mm7 \
- __asm mov A,0x6CB7 \
- __asm punpckhwd mm5,mm6 \
- __asm movd mm6,A \
- __asm pmaddwd mm5,mm7 \
- __asm psubw mm0,mm3 \
- __asm punpckldq mm6,mm6 \
- /*mm7={60547-0x7FFF,0x7FFF}x2 \
- mm2=_y[7]=v=(s*20539+0x3000>>20)+s+(s!=0)*/ \
- __asm psrad mm4,20 \
- __asm paddw mm2,mm0 \
- __asm psrad mm5,20 \
- __asm mov A,0x7FFF6C84 \
- __asm packssdw mm4,mm5 \
- __asm movd mm7,A \
- __asm paddw mm2,mm4 \
- __asm punpckldq mm7,mm7 \
- /*mm0=0, mm7={25080}x4 \
- mm2=t2'', mm5:mm4=60547*t3''+0x6CB7*/ \
- __asm movq mm4,mm1 \
- __asm movq mm5,mm1 \
- __asm movq [Y+_r7],mm2 \
- __asm punpcklwd mm4,mm1 \
- __asm movq mm2,[Y+_r2] \
- __asm pmaddwd mm4,mm7 \
- __asm mov A,0x61F861F8 \
- __asm punpckhwd mm5,mm1 \
- __asm pxor mm0,mm0 \
- __asm pmaddwd mm5,mm7 \
- __asm movd mm7,A \
- __asm pcmpeqw mm1,mm0 \
- __asm psubw mm1,mm3 \
- __asm punpckldq mm7,mm7 \
- __asm paddd mm4,mm6 \
- __asm paddd mm5,mm6 \
- /*mm3:mm1=25080*t2''+((t3''!=0)<<16)*/ \
- __asm movq mm6,mm2 \
- __asm movq mm3,mm2 \
- __asm pmulhw mm6,mm7 \
- __asm pmullw mm3,mm7 \
- __asm paddw mm6,mm1 \
- __asm movq mm1,mm3 \
- __asm punpckhwd mm3,mm6 \
- __asm punpcklwd mm1,mm6 \
- /*mm1={-1}x4 \
- mm4=u=(25080*t2''+60547*t3''+0x6CB7>>16)+(t3''!=0)*/ \
- __asm paddd mm5,mm3 \
- __asm paddd mm4,mm1 \
- __asm psrad mm5,16 \
- __asm mov A,0x28005460 \
- __asm psrad mm4,16 \
- __asm pcmpeqb mm1,mm1 \
- __asm packssdw mm4,mm5 \
- /*mm5={1}x4, mm6=_y[2]=u, mm7={21600,0x2800}x2 \
- mm4=s=(25080*u>>16)-t2''*/ \
- __asm movq mm6,mm4 \
- __asm pmulhw mm4,mm7 \
- __asm pxor mm5,mm5 \
- __asm movd mm7,A \
- __asm psubw mm5,mm1 \
- __asm punpckldq mm7,mm7 \
- __asm psubw mm4,mm2 \
- /*mm2=s+(s!=0) \
- mm4:mm3=s*21600+0x2800*/ \
- __asm movq mm3,mm4 \
- __asm movq mm2,mm4 \
- __asm punpckhwd mm4,mm5 \
- __asm pcmpeqw mm0,mm2 \
- __asm pmaddwd mm4,mm7 \
- __asm psubw mm0,mm1 \
- __asm punpcklwd mm3,mm5 \
- __asm paddw mm2,mm0 \
- __asm pmaddwd mm3,mm7 \
- /*mm0=_y[4], mm1=_y[7], mm4=_y[0], mm5=_y[5] \
- mm3=_y[6]=v=(s*21600+0x2800>>18)+s+(s!=0)*/ \
- __asm movq mm0,[Y+_r4] \
- __asm psrad mm4,18 \
- __asm movq mm5,[Y+_r5] \
- __asm psrad mm3,18 \
- __asm movq mm1,[Y+_r7] \
- __asm packssdw mm3,mm4 \
- __asm movq mm4,[Y+_r0] \
- __asm paddw mm3,mm2 \
-}
-
-/*On input, mm4=_y[0], mm6=_y[2], mm0=_y[4], mm5=_y[5], mm3=_y[6], mm1=_y[7].
- On output, {_y[4],mm1,mm2,mm3} contains the transpose of _y[4...7] and
- {mm4,mm5,mm6,mm7} contains the transpose of _y[0...3].*/
-#define OC_TRANSPOSE8x4(_r0,_r1,_r2,_r3,_r4,_r5,_r6,_r7) __asm{ \
- /*First 4x4 transpose:*/ \
- /*mm0 = e3 e2 e1 e0 \
- mm5 = f3 f2 f1 f0 \
- mm3 = g3 g2 g1 g0 \
- mm1 = h3 h2 h1 h0*/ \
- __asm movq mm2,mm0 \
- __asm punpcklwd mm0,mm5 \
- __asm punpckhwd mm2,mm5 \
- __asm movq mm5,mm3 \
- __asm punpcklwd mm3,mm1 \
- __asm punpckhwd mm5,mm1 \
- /*mm0 = f1 e1 f0 e0 \
- mm2 = f3 e3 f2 e2 \
- mm3 = h1 g1 h0 g0 \
- mm5 = h3 g3 h2 g2*/ \
- __asm movq mm1,mm0 \
- __asm punpckldq mm0,mm3 \
- __asm movq [Y+_r4],mm0 \
- __asm punpckhdq mm1,mm3 \
- __asm movq mm0,[Y+_r1] \
- __asm movq mm3,mm2 \
- __asm punpckldq mm2,mm5 \
- __asm punpckhdq mm3,mm5 \
- __asm movq mm5,[Y+_r3] \
- /*_y[4] = h0 g0 f0 e0 \
- mm1 = h1 g1 f1 e1 \
- mm2 = h2 g2 f2 e2 \
- mm3 = h3 g3 f3 e3*/ \
- /*Second 4x4 transpose:*/ \
- /*mm4 = a3 a2 a1 a0 \
- mm0 = b3 b2 b1 b0 \
- mm6 = c3 c2 c1 c0 \
- mm5 = d3 d2 d1 d0*/ \
- __asm movq mm7,mm4 \
- __asm punpcklwd mm4,mm0 \
- __asm punpckhwd mm7,mm0 \
- __asm movq mm0,mm6 \
- __asm punpcklwd mm6,mm5 \
- __asm punpckhwd mm0,mm5 \
- /*mm4 = b1 a1 b0 a0 \
- mm7 = b3 a3 b2 a2 \
- mm6 = d1 c1 d0 c0 \
- mm0 = d3 c3 d2 c2*/ \
- __asm movq mm5,mm4 \
- __asm punpckldq mm4,mm6 \
- __asm punpckhdq mm5,mm6 \
- __asm movq mm6,mm7 \
- __asm punpckhdq mm7,mm0 \
- __asm punpckldq mm6,mm0 \
- /*mm4 = d0 c0 b0 a0 \
- mm5 = d1 c1 b1 a1 \
- mm6 = d2 c2 b2 a2 \
- mm7 = d3 c3 b3 a3*/ \
-}
-
-/*MMX implementation of the fDCT.*/
-void oc_enc_fdct8x8_mmx(ogg_int16_t _y[64],const ogg_int16_t _x[64]){
- ptrdiff_t a;
- __asm{
-#define Y eax
-#define A ecx
-#define X edx
- /*Add two extra bits of working precision to improve accuracy; any more and
- we could overflow.*/
- /*We also add biases to correct for some systematic error that remains in
- the full fDCT->iDCT round trip.*/
- mov X, _x
- mov Y, _y
- movq mm0,[0x00+X]
- movq mm1,[0x10+X]
- movq mm2,[0x20+X]
- movq mm3,[0x30+X]
- pcmpeqb mm4,mm4
- pxor mm7,mm7
- movq mm5,mm0
- psllw mm0,2
- pcmpeqw mm5,mm7
- movq mm7,[0x70+X]
- psllw mm1,2
- psubw mm5,mm4
- psllw mm2,2
- mov A,1
- pslld mm5,16
- movd mm6,A
- psllq mm5,16
- mov A,0x10001
- psllw mm3,2
- movd mm4,A
- punpckhwd mm5,mm6
- psubw mm1,mm6
- movq mm6,[0x60+X]
- paddw mm0,mm5
- movq mm5,[0x50+X]
- paddw mm0,mm4
- movq mm4,[0x40+X]
- /*We inline stage1 of the transform here so we can get better instruction
- scheduling with the shifts.*/
- /*mm0=t7'=t0-t7*/
- psllw mm7,2
- psubw mm0,mm7
- psllw mm6,2
- paddw mm7,mm7
- /*mm1=t6'=t1-t6*/
- psllw mm5,2
- psubw mm1,mm6
- psllw mm4,2
- paddw mm6,mm6
- /*mm2=t5'=t2-t5*/
- psubw mm2,mm5
- paddw mm5,mm5
- /*mm3=t4'=t3-t4*/
- psubw mm3,mm4
- paddw mm4,mm4
- /*mm7=t0'=t0+t7*/
- paddw mm7,mm0
- /*mm6=t1'=t1+t6*/
- paddw mm6,mm1
- /*mm5=t2'=t2+t5*/
- paddw mm5,mm2
- /*mm4=t3'=t3+t4*/
- paddw mm4,mm3
- OC_FDCT8x4(0x00,0x10,0x20,0x30,0x40,0x50,0x60,0x70)
- OC_TRANSPOSE8x4(0x00,0x10,0x20,0x30,0x40,0x50,0x60,0x70)
- /*Swap out this 8x4 block for the next one.*/
- movq mm0,[0x08+X]
- movq [0x30+Y],mm7
- movq mm7,[0x78+X]
- movq [0x50+Y],mm1
- movq mm1,[0x18+X]
- movq [0x20+Y],mm6
- movq mm6,[0x68+X]
- movq [0x60+Y],mm2
- movq mm2,[0x28+X]
- movq [0x10+Y],mm5
- movq mm5,[0x58+X]
- movq [0x70+Y],mm3
- movq mm3,[0x38+X]
- /*And increase its working precision, too.*/
- psllw mm0,2
- movq [0x00+Y],mm4
- psllw mm7,2
- movq mm4,[0x48+X]
- /*We inline stage1 of the transform here so we can get better instruction
- scheduling with the shifts.*/
- /*mm0=t7'=t0-t7*/
- psubw mm0,mm7
- psllw mm1,2
- paddw mm7,mm7
- psllw mm6,2
- /*mm1=t6'=t1-t6*/
- psubw mm1,mm6
- psllw mm2,2
- paddw mm6,mm6
- psllw mm5,2
- /*mm2=t5'=t2-t5*/
- psubw mm2,mm5
- psllw mm3,2
- paddw mm5,mm5
- psllw mm4,2
- /*mm3=t4'=t3-t4*/
- psubw mm3,mm4
- paddw mm4,mm4
- /*mm7=t0'=t0+t7*/
- paddw mm7,mm0
- /*mm6=t1'=t1+t6*/
- paddw mm6,mm1
- /*mm5=t2'=t2+t5*/
- paddw mm5,mm2
- /*mm4=t3'=t3+t4*/
- paddw mm4,mm3
- OC_FDCT8x4(0x08,0x18,0x28,0x38,0x48,0x58,0x68,0x78)
- OC_TRANSPOSE8x4(0x08,0x18,0x28,0x38,0x48,0x58,0x68,0x78)
- /*Here the first 4x4 block of output from the last transpose is the second
- 4x4 block of input for the next transform.
- We have cleverly arranged that it already be in the appropriate place,
- so we only have to do half the stores and loads.*/
- movq mm0,[0x00+Y]
- movq [0x58+Y],mm1
- movq mm1,[0x10+Y]
- movq [0x68+Y],mm2
- movq mm2,[0x20+Y]
- movq [0x78+Y],mm3
- movq mm3,[0x30+Y]
- OC_FDCT_STAGE1_8x4
- OC_FDCT8x4(0x00,0x10,0x20,0x30,0x08,0x18,0x28,0x38)
- OC_TRANSPOSE8x4(0x00,0x10,0x20,0x30,0x08,0x18,0x28,0x38)
- /*mm0={-2}x4*/
- pcmpeqw mm0,mm0
- paddw mm0,mm0
- /*Round the results.*/
- psubw mm1,mm0
- psubw mm2,mm0
- psraw mm1,2
- psubw mm3,mm0
- movq [0x18+Y],mm1
- psraw mm2,2
- psubw mm4,mm0
- movq mm1,[0x08+Y]
- psraw mm3,2
- psubw mm5,mm0
- psraw mm4,2
- psubw mm6,mm0
- psraw mm5,2
- psubw mm7,mm0
- psraw mm6,2
- psubw mm1,mm0
- psraw mm7,2
- movq mm0,[0x40+Y]
- psraw mm1,2
- movq [0x30+Y],mm7
- movq mm7,[0x78+Y]
- movq [0x08+Y],mm1
- movq mm1,[0x50+Y]
- movq [0x20+Y],mm6
- movq mm6,[0x68+Y]
- movq [0x28+Y],mm2
- movq mm2,[0x60+Y]
- movq [0x10+Y],mm5
- movq mm5,[0x58+Y]
- movq [0x38+Y],mm3
- movq mm3,[0x70+Y]
- movq [0x00+Y],mm4
- movq mm4,[0x48+Y]
- OC_FDCT_STAGE1_8x4
- OC_FDCT8x4(0x40,0x50,0x60,0x70,0x48,0x58,0x68,0x78)
- OC_TRANSPOSE8x4(0x40,0x50,0x60,0x70,0x48,0x58,0x68,0x78)
- /*mm0={-2}x4*/
- pcmpeqw mm0,mm0
- paddw mm0,mm0
- /*Round the results.*/
- psubw mm1,mm0
- psubw mm2,mm0
- psraw mm1,2
- psubw mm3,mm0
- movq [0x58+Y],mm1
- psraw mm2,2
- psubw mm4,mm0
- movq mm1,[0x48+Y]
- psraw mm3,2
- psubw mm5,mm0
- movq [0x68+Y],mm2
- psraw mm4,2
- psubw mm6,mm0
- movq [0x78+Y],mm3
- psraw mm5,2
- psubw mm7,mm0
- movq [0x40+Y],mm4
- psraw mm6,2
- psubw mm1,mm0
- movq [0x50+Y],mm5
- psraw mm7,2
- movq [0x60+Y],mm6
- psraw mm1,2
- movq [0x70+Y],mm7
- movq [0x48+Y],mm1
-#undef Y
-#undef A
-#undef X
- }
-}
-
-#endif
+/********************************************************************
+ * *
+ * THIS FILE IS PART OF THE OggTheora SOFTWARE CODEC SOURCE CODE. *
+ * USE, DISTRIBUTION AND REPRODUCTION OF THIS LIBRARY SOURCE IS *
+ * GOVERNED BY A BSD-STYLE SOURCE LICENSE INCLUDED WITH THIS SOURCE *
+ * IN 'COPYING'. PLEASE READ THESE TERMS BEFORE DISTRIBUTING. *
+ * *
+ * THE Theora SOURCE CODE IS COPYRIGHT (C) 1999-2006 *
+ * by the Xiph.Org Foundation http://www.xiph.org/ *
+ * *
+ ********************************************************************/
+ /*MMX fDCT implementation for x86_32*/
+/*$Id: fdct_ses2.c 14579 2008-03-12 06:42:40Z xiphmont $*/
+#include "x86enc.h"
+
+#if defined(OC_X86_ASM)
+
+#define OC_FDCT_STAGE1_8x4 __asm{ \
+ /*Stage 1:*/ \
+ /*mm0=t7'=t0-t7*/ \
+ __asm psubw mm0,mm7 \
+ __asm paddw mm7,mm7 \
+ /*mm1=t6'=t1-t6*/ \
+ __asm psubw mm1, mm6 \
+ __asm paddw mm6,mm6 \
+ /*mm2=t5'=t2-t5*/ \
+ __asm psubw mm2,mm5 \
+ __asm paddw mm5,mm5 \
+ /*mm3=t4'=t3-t4*/ \
+ __asm psubw mm3,mm4 \
+ __asm paddw mm4,mm4 \
+ /*mm7=t0'=t0+t7*/ \
+ __asm paddw mm7,mm0 \
+ /*mm6=t1'=t1+t6*/ \
+ __asm paddw mm6,mm1 \
+ /*mm5=t2'=t2+t5*/ \
+ __asm paddw mm5,mm2 \
+ /*mm4=t3'=t3+t4*/ \
+ __asm paddw mm4,mm3\
+}
+
+#define OC_FDCT8x4(_r0,_r1,_r2,_r3,_r4,_r5,_r6,_r7) __asm{ \
+ /*Stage 2:*/ \
+ /*mm7=t3''=t0'-t3'*/ \
+ __asm psubw mm7,mm4 \
+ __asm paddw mm4,mm4 \
+ /*mm6=t2''=t1'-t2'*/ \
+ __asm psubw mm6,mm5 \
+ __asm movq [Y+_r6],mm7 \
+ __asm paddw mm5,mm5 \
+ /*mm1=t5''=t6'-t5'*/ \
+ __asm psubw mm1,mm2 \
+ __asm movq [Y+_r2],mm6 \
+ /*mm4=t0''=t0'+t3'*/ \
+ __asm paddw mm4,mm7 \
+ __asm paddw mm2,mm2 \
+ /*mm5=t1''=t1'+t2'*/ \
+ __asm movq [Y+_r0],mm4 \
+ __asm paddw mm5,mm6 \
+ /*mm2=t6''=t6'+t5'*/ \
+ __asm paddw mm2,mm1 \
+ __asm movq [Y+_r4],mm5 \
+ /*mm0=t7', mm1=t5'', mm2=t6'', mm3=t4'.*/ \
+ /*mm4, mm5, mm6, mm7 are free.*/ \
+ /*Stage 3:*/ \
+ /*mm6={2}x4, mm7={27146,0xB500>>1}x2*/ \
+ __asm mov A,0x5A806A0A \
+ __asm pcmpeqb mm6,mm6 \
+ __asm movd mm7,A \
+ __asm psrlw mm6,15 \
+ __asm punpckldq mm7,mm7 \
+ __asm paddw mm6,mm6 \
+ /*mm0=0, m2={-1}x4 \
+ mm5:mm4=t5''*27146+0xB500*/ \
+ __asm movq mm4,mm1 \
+ __asm movq mm5,mm1 \
+ __asm punpcklwd mm4,mm6 \
+ __asm movq [Y+_r3],mm2 \
+ __asm pmaddwd mm4,mm7 \
+ __asm movq [Y+_r7],mm0 \
+ __asm punpckhwd mm5,mm6 \
+ __asm pxor mm0,mm0 \
+ __asm pmaddwd mm5,mm7 \
+ __asm pcmpeqb mm2,mm2 \
+ /*mm2=t6'', mm1=t5''+(t5''!=0) \
+ mm4=(t5''*27146+0xB500>>16)*/ \
+ __asm pcmpeqw mm0,mm1 \
+ __asm psrad mm4,16 \
+ __asm psubw mm0,mm2 \
+ __asm movq mm2, [Y+_r3] \
+ __asm psrad mm5,16 \
+ __asm paddw mm1,mm0 \
+ __asm packssdw mm4,mm5 \
+ /*mm4=s=(t5''*27146+0xB500>>16)+t5''+(t5''!=0)>>1*/ \
+ __asm paddw mm4,mm1 \
+ __asm movq mm0, [Y+_r7] \
+ __asm psraw mm4,1 \
+ __asm movq mm1,mm3 \
+ /*mm3=t4''=t4'+s*/ \
+ __asm paddw mm3,mm4 \
+ /*mm1=t5'''=t4'-s*/ \
+ __asm psubw mm1,mm4 \
+ /*mm1=0, mm3={-1}x4 \
+ mm5:mm4=t6''*27146+0xB500*/ \
+ __asm movq mm4,mm2 \
+ __asm movq mm5,mm2 \
+ __asm punpcklwd mm4,mm6 \
+ __asm movq [Y+_r5],mm1 \
+ __asm pmaddwd mm4,mm7 \
+ __asm movq [Y+_r1],mm3 \
+ __asm punpckhwd mm5,mm6 \
+ __asm pxor mm1,mm1 \
+ __asm pmaddwd mm5,mm7 \
+ __asm pcmpeqb mm3,mm3 \
+ /*mm2=t6''+(t6''!=0), mm4=(t6''*27146+0xB500>>16)*/ \
+ __asm psrad mm4,16 \
+ __asm pcmpeqw mm1,mm2 \
+ __asm psrad mm5,16 \
+ __asm psubw mm1,mm3 \
+ __asm packssdw mm4,mm5 \
+ __asm paddw mm2,mm1 \
+ /*mm1=t1'' \
+ mm4=s=(t6''*27146+0xB500>>16)+t6''+(t6''!=0)>>1*/ \
+ __asm paddw mm4,mm2 \
+ __asm movq mm1,[Y+_r4] \
+ __asm psraw mm4,1 \
+ __asm movq mm2,mm0 \
+ /*mm7={54491-0x7FFF,0x7FFF}x2 \
+ mm0=t7''=t7'+s*/ \
+ __asm paddw mm0,mm4 \
+ /*mm2=t6'''=t7'-s*/ \
+ __asm psubw mm2,mm4 \
+ /*Stage 4:*/ \
+ /*mm0=0, mm2=t0'' \
+ mm5:mm4=t1''*27146+0xB500*/ \
+ __asm movq mm4,mm1 \
+ __asm movq mm5,mm1 \
+ __asm punpcklwd mm4,mm6 \
+ __asm movq [Y+_r3],mm2 \
+ __asm pmaddwd mm4,mm7 \
+ __asm movq mm2,[Y+_r0] \
+ __asm punpckhwd mm5,mm6 \
+ __asm movq [Y+_r7],mm0 \
+ __asm pmaddwd mm5,mm7 \
+ __asm pxor mm0,mm0 \
+ /*mm7={27146,0x4000>>1}x2 \
+ mm0=s=(t1''*27146+0xB500>>16)+t1''+(t1''!=0)*/ \
+ __asm psrad mm4,16 \
+ __asm mov A,0x20006A0A \
+ __asm pcmpeqw mm0,mm1 \
+ __asm movd mm7,A \
+ __asm psrad mm5,16 \
+ __asm psubw mm0,mm3 \
+ __asm packssdw mm4,mm5 \
+ __asm paddw mm0,mm1 \
+ __asm punpckldq mm7,mm7 \
+ __asm paddw mm0,mm4 \
+ /*mm6={0x00000E3D}x2 \
+ mm1=-(t0''==0), mm5:mm4=t0''*27146+0x4000*/ \
+ __asm movq mm4,mm2 \
+ __asm movq mm5,mm2 \
+ __asm punpcklwd mm4,mm6 \
+ __asm mov A,0x0E3D \
+ __asm pmaddwd mm4,mm7 \
+ __asm punpckhwd mm5,mm6 \
+ __asm movd mm6,A \
+ __asm pmaddwd mm5,mm7 \
+ __asm pxor mm1,mm1 \
+ __asm punpckldq mm6,mm6 \
+ __asm pcmpeqw mm1,mm2 \
+ /*mm4=r=(t0''*27146+0x4000>>16)+t0''+(t0''!=0)*/ \
+ __asm psrad mm4,16 \
+ __asm psubw mm1,mm3 \
+ __asm psrad mm5,16 \
+ __asm paddw mm2,mm1 \
+ __asm packssdw mm4,mm5 \
+ __asm movq mm1,[Y+_r5] \
+ __asm paddw mm4,mm2 \
+ /*mm2=t6'', mm0=_y[0]=u=r+s>>1 \
+ The naive implementation could cause overflow, so we use \
+ u=(r&s)+((r^s)>>1).*/ \
+ __asm movq mm2,[Y+_r3] \
+ __asm movq mm7,mm0 \
+ __asm pxor mm0,mm4 \
+ __asm pand mm7,mm4 \
+ __asm psraw mm0,1 \
+ __asm mov A,0x7FFF54DC \
+ __asm paddw mm0,mm7 \
+ __asm movd mm7,A \
+ /*mm7={54491-0x7FFF,0x7FFF}x2 \
+ mm4=_y[4]=v=r-u*/ \
+ __asm psubw mm4,mm0 \
+ __asm punpckldq mm7,mm7 \
+ __asm movq [Y+_r4],mm4 \
+ /*mm0=0, mm7={36410}x4 \
+ mm1=(t5'''!=0), mm5:mm4=54491*t5'''+0x0E3D*/ \
+ __asm movq mm4,mm1 \
+ __asm movq mm5,mm1 \
+ __asm punpcklwd mm4,mm1 \
+ __asm mov A,0x8E3A8E3A \
+ __asm pmaddwd mm4,mm7 \
+ __asm movq [Y+_r0],mm0 \
+ __asm punpckhwd mm5,mm1 \
+ __asm pxor mm0,mm0 \
+ __asm pmaddwd mm5,mm7 \
+ __asm pcmpeqw mm1,mm0 \
+ __asm movd mm7,A \
+ __asm psubw mm1,mm3 \
+ __asm punpckldq mm7,mm7 \
+ __asm paddd mm4,mm6 \
+ __asm paddd mm5,mm6 \
+ /*mm0=0 \
+ mm3:mm1=36410*t6'''+((t5'''!=0)<<16)*/ \
+ __asm movq mm6,mm2 \
+ __asm movq mm3,mm2 \
+ __asm pmulhw mm6,mm7 \
+ __asm paddw mm1,mm2 \
+ __asm pmullw mm3,mm7 \
+ __asm pxor mm0,mm0 \
+ __asm paddw mm6,mm1 \
+ __asm movq mm1,mm3 \
+ __asm punpckhwd mm3,mm6 \
+ __asm punpcklwd mm1,mm6 \
+ /*mm3={-1}x4, mm6={1}x4 \
+ mm4=_y[5]=u=(54491*t5'''+36410*t6'''+0x0E3D>>16)+(t5'''!=0)*/ \
+ __asm paddd mm5,mm3 \
+ __asm paddd mm4,mm1 \
+ __asm psrad mm5,16 \
+ __asm pxor mm6,mm6 \
+ __asm psrad mm4,16 \
+ __asm pcmpeqb mm3,mm3 \
+ __asm packssdw mm4,mm5 \
+ __asm psubw mm6,mm3 \
+ /*mm1=t7'', mm7={26568,0x3400}x2 \
+ mm2=s=t6'''-(36410*u>>16)*/ \
+ __asm movq mm1,mm4 \
+ __asm mov A,0x340067C8 \
+ __asm pmulhw mm4,mm7 \
+ __asm movd mm7,A \
+ __asm movq [Y+_r5],mm1 \
+ __asm punpckldq mm7,mm7 \
+ __asm paddw mm4,mm1 \
+ __asm movq mm1,[Y+_r7] \
+ __asm psubw mm2,mm4 \
+ /*mm6={0x00007B1B}x2 \
+ mm0=(s!=0), mm5:mm4=s*26568+0x3400*/ \
+ __asm movq mm4,mm2 \
+ __asm movq mm5,mm2 \
+ __asm punpcklwd mm4,mm6 \
+ __asm pcmpeqw mm0,mm2 \
+ __asm pmaddwd mm4,mm7 \
+ __asm mov A,0x7B1B \
+ __asm punpckhwd mm5,mm6 \
+ __asm movd mm6,A \
+ __asm pmaddwd mm5,mm7 \
+ __asm psubw mm0,mm3 \
+ __asm punpckldq mm6,mm6 \
+ /*mm7={64277-0x7FFF,0x7FFF}x2 \
+ mm2=_y[3]=v=(s*26568+0x3400>>17)+s+(s!=0)*/ \
+ __asm psrad mm4,17 \
+ __asm paddw mm2,mm0 \
+ __asm psrad mm5,17 \
+ __asm mov A,0x7FFF7B16 \
+ __asm packssdw mm4,mm5 \
+ __asm movd mm7,A \
+ __asm paddw mm2,mm4 \
+ __asm punpckldq mm7,mm7 \
+ /*mm0=0, mm7={12785}x4 \
+ mm1=(t7''!=0), mm2=t4'', mm5:mm4=64277*t7''+0x7B1B*/ \
+ __asm movq mm4,mm1 \
+ __asm movq mm5,mm1 \
+ __asm movq [Y+_r3],mm2 \
+ __asm punpcklwd mm4,mm1 \
+ __asm movq mm2,[Y+_r1] \
+ __asm pmaddwd mm4,mm7 \
+ __asm mov A,0x31F131F1 \
+ __asm punpckhwd mm5,mm1 \
+ __asm pxor mm0,mm0 \
+ __asm pmaddwd mm5,mm7 \
+ __asm pcmpeqw mm1,mm0 \
+ __asm movd mm7,A \
+ __asm psubw mm1,mm3 \
+ __asm punpckldq mm7,mm7 \
+ __asm paddd mm4,mm6 \
+ __asm paddd mm5,mm6 \
+ /*mm3:mm1=12785*t4'''+((t7''!=0)<<16)*/ \
+ __asm movq mm6,mm2 \
+ __asm movq mm3,mm2 \
+ __asm pmulhw mm6,mm7 \
+ __asm pmullw mm3,mm7 \
+ __asm paddw mm6,mm1 \
+ __asm movq mm1,mm3 \
+ __asm punpckhwd mm3,mm6 \
+ __asm punpcklwd mm1,mm6 \
+ /*mm3={-1}x4, mm6={1}x4 \
+ mm4=_y[1]=u=(12785*t4'''+64277*t7''+0x7B1B>>16)+(t7''!=0)*/ \
+ __asm paddd mm5,mm3 \
+ __asm paddd mm4,mm1 \
+ __asm psrad mm5,16 \
+ __asm pxor mm6,mm6 \
+ __asm psrad mm4,16 \
+ __asm pcmpeqb mm3,mm3 \
+ __asm packssdw mm4,mm5 \
+ __asm psubw mm6,mm3 \
+ /*mm1=t3'', mm7={20539,0x3000}x2 \
+ mm4=s=(12785*u>>16)-t4''*/ \
+ __asm movq [Y+_r1],mm4 \
+ __asm pmulhw mm4,mm7 \
+ __asm mov A,0x3000503B \
+ __asm movq mm1,[Y+_r6] \
+ __asm movd mm7,A \
+ __asm psubw mm4,mm2 \
+ __asm punpckldq mm7,mm7 \
+ /*mm6={0x00006CB7}x2 \
+ mm0=(s!=0), mm5:mm4=s*20539+0x3000*/ \
+ __asm movq mm5,mm4 \
+ __asm movq mm2,mm4 \
+ __asm punpcklwd mm4,mm6 \
+ __asm pcmpeqw mm0,mm2 \
+ __asm pmaddwd mm4,mm7 \
+ __asm mov A,0x6CB7 \
+ __asm punpckhwd mm5,mm6 \
+ __asm movd mm6,A \
+ __asm pmaddwd mm5,mm7 \
+ __asm psubw mm0,mm3 \
+ __asm punpckldq mm6,mm6 \
+ /*mm7={60547-0x7FFF,0x7FFF}x2 \
+ mm2=_y[7]=v=(s*20539+0x3000>>20)+s+(s!=0)*/ \
+ __asm psrad mm4,20 \
+ __asm paddw mm2,mm0 \
+ __asm psrad mm5,20 \
+ __asm mov A,0x7FFF6C84 \
+ __asm packssdw mm4,mm5 \
+ __asm movd mm7,A \
+ __asm paddw mm2,mm4 \
+ __asm punpckldq mm7,mm7 \
+ /*mm0=0, mm7={25080}x4 \
+ mm2=t2'', mm5:mm4=60547*t3''+0x6CB7*/ \
+ __asm movq mm4,mm1 \
+ __asm movq mm5,mm1 \
+ __asm movq [Y+_r7],mm2 \
+ __asm punpcklwd mm4,mm1 \
+ __asm movq mm2,[Y+_r2] \
+ __asm pmaddwd mm4,mm7 \
+ __asm mov A,0x61F861F8 \
+ __asm punpckhwd mm5,mm1 \
+ __asm pxor mm0,mm0 \
+ __asm pmaddwd mm5,mm7 \
+ __asm movd mm7,A \
+ __asm pcmpeqw mm1,mm0 \
+ __asm psubw mm1,mm3 \
+ __asm punpckldq mm7,mm7 \
+ __asm paddd mm4,mm6 \
+ __asm paddd mm5,mm6 \
+ /*mm3:mm1=25080*t2''+((t3''!=0)<<16)*/ \
+ __asm movq mm6,mm2 \
+ __asm movq mm3,mm2 \
+ __asm pmulhw mm6,mm7 \
+ __asm pmullw mm3,mm7 \
+ __asm paddw mm6,mm1 \
+ __asm movq mm1,mm3 \
+ __asm punpckhwd mm3,mm6 \
+ __asm punpcklwd mm1,mm6 \
+ /*mm1={-1}x4 \
+ mm4=u=(25080*t2''+60547*t3''+0x6CB7>>16)+(t3''!=0)*/ \
+ __asm paddd mm5,mm3 \
+ __asm paddd mm4,mm1 \
+ __asm psrad mm5,16 \
+ __asm mov A,0x28005460 \
+ __asm psrad mm4,16 \
+ __asm pcmpeqb mm1,mm1 \
+ __asm packssdw mm4,mm5 \
+ /*mm5={1}x4, mm6=_y[2]=u, mm7={21600,0x2800}x2 \
+ mm4=s=(25080*u>>16)-t2''*/ \
+ __asm movq mm6,mm4 \
+ __asm pmulhw mm4,mm7 \
+ __asm pxor mm5,mm5 \
+ __asm movd mm7,A \
+ __asm psubw mm5,mm1 \
+ __asm punpckldq mm7,mm7 \
+ __asm psubw mm4,mm2 \
+ /*mm2=s+(s!=0) \
+ mm4:mm3=s*21600+0x2800*/ \
+ __asm movq mm3,mm4 \
+ __asm movq mm2,mm4 \
+ __asm punpckhwd mm4,mm5 \
+ __asm pcmpeqw mm0,mm2 \
+ __asm pmaddwd mm4,mm7 \
+ __asm psubw mm0,mm1 \
+ __asm punpcklwd mm3,mm5 \
+ __asm paddw mm2,mm0 \
+ __asm pmaddwd mm3,mm7 \
+ /*mm0=_y[4], mm1=_y[7], mm4=_y[0], mm5=_y[5] \
+ mm3=_y[6]=v=(s*21600+0x2800>>18)+s+(s!=0)*/ \
+ __asm movq mm0,[Y+_r4] \
+ __asm psrad mm4,18 \
+ __asm movq mm5,[Y+_r5] \
+ __asm psrad mm3,18 \
+ __asm movq mm1,[Y+_r7] \
+ __asm packssdw mm3,mm4 \
+ __asm movq mm4,[Y+_r0] \
+ __asm paddw mm3,mm2 \
+}
+
+/*On input, mm4=_y[0], mm6=_y[2], mm0=_y[4], mm5=_y[5], mm3=_y[6], mm1=_y[7].
+ On output, {_y[4],mm1,mm2,mm3} contains the transpose of _y[4...7] and
+ {mm4,mm5,mm6,mm7} contains the transpose of _y[0...3].*/
+#define OC_TRANSPOSE8x4(_r0,_r1,_r2,_r3,_r4,_r5,_r6,_r7) __asm{ \
+ /*First 4x4 transpose:*/ \
+ /*mm0 = e3 e2 e1 e0 \
+ mm5 = f3 f2 f1 f0 \
+ mm3 = g3 g2 g1 g0 \
+ mm1 = h3 h2 h1 h0*/ \
+ __asm movq mm2,mm0 \
+ __asm punpcklwd mm0,mm5 \
+ __asm punpckhwd mm2,mm5 \
+ __asm movq mm5,mm3 \
+ __asm punpcklwd mm3,mm1 \
+ __asm punpckhwd mm5,mm1 \
+ /*mm0 = f1 e1 f0 e0 \
+ mm2 = f3 e3 f2 e2 \
+ mm3 = h1 g1 h0 g0 \
+ mm5 = h3 g3 h2 g2*/ \
+ __asm movq mm1,mm0 \
+ __asm punpckldq mm0,mm3 \
+ __asm movq [Y+_r4],mm0 \
+ __asm punpckhdq mm1,mm3 \
+ __asm movq mm0,[Y+_r1] \
+ __asm movq mm3,mm2 \
+ __asm punpckldq mm2,mm5 \
+ __asm punpckhdq mm3,mm5 \
+ __asm movq mm5,[Y+_r3] \
+ /*_y[4] = h0 g0 f0 e0 \
+ mm1 = h1 g1 f1 e1 \
+ mm2 = h2 g2 f2 e2 \
+ mm3 = h3 g3 f3 e3*/ \
+ /*Second 4x4 transpose:*/ \
+ /*mm4 = a3 a2 a1 a0 \
+ mm0 = b3 b2 b1 b0 \
+ mm6 = c3 c2 c1 c0 \
+ mm5 = d3 d2 d1 d0*/ \
+ __asm movq mm7,mm4 \
+ __asm punpcklwd mm4,mm0 \
+ __asm punpckhwd mm7,mm0 \
+ __asm movq mm0,mm6 \
+ __asm punpcklwd mm6,mm5 \
+ __asm punpckhwd mm0,mm5 \
+ /*mm4 = b1 a1 b0 a0 \
+ mm7 = b3 a3 b2 a2 \
+ mm6 = d1 c1 d0 c0 \
+ mm0 = d3 c3 d2 c2*/ \
+ __asm movq mm5,mm4 \
+ __asm punpckldq mm4,mm6 \
+ __asm punpckhdq mm5,mm6 \
+ __asm movq mm6,mm7 \
+ __asm punpckhdq mm7,mm0 \
+ __asm punpckldq mm6,mm0 \
+ /*mm4 = d0 c0 b0 a0 \
+ mm5 = d1 c1 b1 a1 \
+ mm6 = d2 c2 b2 a2 \
+ mm7 = d3 c3 b3 a3*/ \
+}
+
+/*MMX implementation of the fDCT.*/
+void oc_enc_fdct8x8_mmx(ogg_int16_t _y[64],const ogg_int16_t _x[64]){
+ ptrdiff_t a;
+ __asm{
+#define Y eax
+#define A ecx
+#define X edx
+ /*Add two extra bits of working precision to improve accuracy; any more and
+ we could overflow.*/
+ /*We also add biases to correct for some systematic error that remains in
+ the full fDCT->iDCT round trip.*/
+ mov X, _x
+ mov Y, _y
+ movq mm0,[0x00+X]
+ movq mm1,[0x10+X]
+ movq mm2,[0x20+X]
+ movq mm3,[0x30+X]
+ pcmpeqb mm4,mm4
+ pxor mm7,mm7
+ movq mm5,mm0
+ psllw mm0,2
+ pcmpeqw mm5,mm7
+ movq mm7,[0x70+X]
+ psllw mm1,2
+ psubw mm5,mm4
+ psllw mm2,2
+ mov A,1
+ pslld mm5,16
+ movd mm6,A
+ psllq mm5,16
+ mov A,0x10001
+ psllw mm3,2
+ movd mm4,A
+ punpckhwd mm5,mm6
+ psubw mm1,mm6
+ movq mm6,[0x60+X]
+ paddw mm0,mm5
+ movq mm5,[0x50+X]
+ paddw mm0,mm4
+ movq mm4,[0x40+X]
+ /*We inline stage1 of the transform here so we can get better instruction
+ scheduling with the shifts.*/
+ /*mm0=t7'=t0-t7*/
+ psllw mm7,2
+ psubw mm0,mm7
+ psllw mm6,2
+ paddw mm7,mm7
+ /*mm1=t6'=t1-t6*/
+ psllw mm5,2
+ psubw mm1,mm6
+ psllw mm4,2
+ paddw mm6,mm6
+ /*mm2=t5'=t2-t5*/
+ psubw mm2,mm5
+ paddw mm5,mm5
+ /*mm3=t4'=t3-t4*/
+ psubw mm3,mm4
+ paddw mm4,mm4
+ /*mm7=t0'=t0+t7*/
+ paddw mm7,mm0
+ /*mm6=t1'=t1+t6*/
+ paddw mm6,mm1
+ /*mm5=t2'=t2+t5*/
+ paddw mm5,mm2
+ /*mm4=t3'=t3+t4*/
+ paddw mm4,mm3
+ OC_FDCT8x4(0x00,0x10,0x20,0x30,0x40,0x50,0x60,0x70)
+ OC_TRANSPOSE8x4(0x00,0x10,0x20,0x30,0x40,0x50,0x60,0x70)
+ /*Swap out this 8x4 block for the next one.*/
+ movq mm0,[0x08+X]
+ movq [0x30+Y],mm7
+ movq mm7,[0x78+X]
+ movq [0x50+Y],mm1
+ movq mm1,[0x18+X]
+ movq [0x20+Y],mm6
+ movq mm6,[0x68+X]
+ movq [0x60+Y],mm2
+ movq mm2,[0x28+X]
+ movq [0x10+Y],mm5
+ movq mm5,[0x58+X]
+ movq [0x70+Y],mm3
+ movq mm3,[0x38+X]
+ /*And increase its working precision, too.*/
+ psllw mm0,2
+ movq [0x00+Y],mm4
+ psllw mm7,2
+ movq mm4,[0x48+X]
+ /*We inline stage1 of the transform here so we can get better instruction
+ scheduling with the shifts.*/
+ /*mm0=t7'=t0-t7*/
+ psubw mm0,mm7
+ psllw mm1,2
+ paddw mm7,mm7
+ psllw mm6,2
+ /*mm1=t6'=t1-t6*/
+ psubw mm1,mm6
+ psllw mm2,2
+ paddw mm6,mm6
+ psllw mm5,2
+ /*mm2=t5'=t2-t5*/
+ psubw mm2,mm5
+ psllw mm3,2
+ paddw mm5,mm5
+ psllw mm4,2
+ /*mm3=t4'=t3-t4*/
+ psubw mm3,mm4
+ paddw mm4,mm4
+ /*mm7=t0'=t0+t7*/
+ paddw mm7,mm0
+ /*mm6=t1'=t1+t6*/
+ paddw mm6,mm1
+ /*mm5=t2'=t2+t5*/
+ paddw mm5,mm2
+ /*mm4=t3'=t3+t4*/
+ paddw mm4,mm3
+ OC_FDCT8x4(0x08,0x18,0x28,0x38,0x48,0x58,0x68,0x78)
+ OC_TRANSPOSE8x4(0x08,0x18,0x28,0x38,0x48,0x58,0x68,0x78)
+ /*Here the first 4x4 block of output from the last transpose is the second
+ 4x4 block of input for the next transform.
+ We have cleverly arranged that it already be in the appropriate place,
+ so we only have to do half the stores and loads.*/
+ movq mm0,[0x00+Y]
+ movq [0x58+Y],mm1
+ movq mm1,[0x10+Y]
+ movq [0x68+Y],mm2
+ movq mm2,[0x20+Y]
+ movq [0x78+Y],mm3
+ movq mm3,[0x30+Y]
+ OC_FDCT_STAGE1_8x4
+ OC_FDCT8x4(0x00,0x10,0x20,0x30,0x08,0x18,0x28,0x38)
+ OC_TRANSPOSE8x4(0x00,0x10,0x20,0x30,0x08,0x18,0x28,0x38)
+ /*mm0={-2}x4*/
+ pcmpeqw mm0,mm0
+ paddw mm0,mm0
+ /*Round the results.*/
+ psubw mm1,mm0
+ psubw mm2,mm0
+ psraw mm1,2
+ psubw mm3,mm0
+ movq [0x18+Y],mm1
+ psraw mm2,2
+ psubw mm4,mm0
+ movq mm1,[0x08+Y]
+ psraw mm3,2
+ psubw mm5,mm0
+ psraw mm4,2
+ psubw mm6,mm0
+ psraw mm5,2
+ psubw mm7,mm0
+ psraw mm6,2
+ psubw mm1,mm0
+ psraw mm7,2
+ movq mm0,[0x40+Y]
+ psraw mm1,2
+ movq [0x30+Y],mm7
+ movq mm7,[0x78+Y]
+ movq [0x08+Y],mm1
+ movq mm1,[0x50+Y]
+ movq [0x20+Y],mm6
+ movq mm6,[0x68+Y]
+ movq [0x28+Y],mm2
+ movq mm2,[0x60+Y]
+ movq [0x10+Y],mm5
+ movq mm5,[0x58+Y]
+ movq [0x38+Y],mm3
+ movq mm3,[0x70+Y]
+ movq [0x00+Y],mm4
+ movq mm4,[0x48+Y]
+ OC_FDCT_STAGE1_8x4
+ OC_FDCT8x4(0x40,0x50,0x60,0x70,0x48,0x58,0x68,0x78)
+ OC_TRANSPOSE8x4(0x40,0x50,0x60,0x70,0x48,0x58,0x68,0x78)
+ /*mm0={-2}x4*/
+ pcmpeqw mm0,mm0
+ paddw mm0,mm0
+ /*Round the results.*/
+ psubw mm1,mm0
+ psubw mm2,mm0
+ psraw mm1,2
+ psubw mm3,mm0
+ movq [0x58+Y],mm1
+ psraw mm2,2
+ psubw mm4,mm0
+ movq mm1,[0x48+Y]
+ psraw mm3,2
+ psubw mm5,mm0
+ movq [0x68+Y],mm2
+ psraw mm4,2
+ psubw mm6,mm0
+ movq [0x78+Y],mm3
+ psraw mm5,2
+ psubw mm7,mm0
+ movq [0x40+Y],mm4
+ psraw mm6,2
+ psubw mm1,mm0
+ movq [0x50+Y],mm5
+ psraw mm7,2
+ movq [0x60+Y],mm6
+ psraw mm1,2
+ movq [0x70+Y],mm7
+ movq [0x48+Y],mm1
+#undef Y
+#undef A
+#undef X
+ }
+}
+
+#endif
diff --git a/thirdparty/nanosvg/LICENSE.txt b/thirdparty/nanosvg/LICENSE.txt
index 6fde401cb2..f896f2eb0f 100644
--- a/thirdparty/nanosvg/LICENSE.txt
+++ b/thirdparty/nanosvg/LICENSE.txt
@@ -1,18 +1,18 @@
-Copyright (c) 2013-14 Mikko Mononen memon@inside.org
-
-This software is provided 'as-is', without any express or implied
-warranty. In no event will the authors be held liable for any damages
-arising from the use of this software.
-
-Permission is granted to anyone to use this software for any purpose,
-including commercial applications, and to alter it and redistribute it
-freely, subject to the following restrictions:
-
-1. The origin of this software must not be misrepresented; you must not
-claim that you wrote the original software. If you use this software
-in a product, an acknowledgment in the product documentation would be
-appreciated but is not required.
-2. Altered source versions must be plainly marked as such, and must not be
-misrepresented as being the original software.
-3. This notice may not be removed or altered from any source distribution.
-
+Copyright (c) 2013-14 Mikko Mononen memon@inside.org
+
+This software is provided 'as-is', without any express or implied
+warranty. In no event will the authors be held liable for any damages
+arising from the use of this software.
+
+Permission is granted to anyone to use this software for any purpose,
+including commercial applications, and to alter it and redistribute it
+freely, subject to the following restrictions:
+
+1. The origin of this software must not be misrepresented; you must not
+claim that you wrote the original software. If you use this software
+in a product, an acknowledgment in the product documentation would be
+appreciated but is not required.
+2. Altered source versions must be plainly marked as such, and must not be
+misrepresented as being the original software.
+3. This notice may not be removed or altered from any source distribution.
+