PHP调用百度OCR文字识别服务


关于百度OCR服务的文档地址:百度OCR文档
百度的API接口需要进行权限认证。
关于百度的权限认证详情见:百度API权限认证
大致说下百度API接口认证的流程:
百度的API认证字段可以包含在你HTTP请求的Header或是URL中,我采用的是包含在http请求的header中。
在请求头中增加一个字段为Authorization,对应的值为加密后的认证字符串。
关于认证字符串的生成规则如下图:
file
可以看到分为CanonicalRequest、SigningKey、Signature三大部分 Signature便是通过前面两个通过SHA256加密而来
认证字符串生成规则:bce-auth-v1/{assessKeyId}/{timestamp}/{expirationPeriodInSeconds}/{signedHeaders}/{Signature}
bce-auth-v1:百度认证协议头
assessKeyId:为你的sk
timestamp:是签名生效时间UTC
expirationPeriodInSeconds:签名有效期(默认为1800)
signedHeaders:header部分中需要签名的字段
Signature:签名字符串
有关于详细的生成流程见官方文档 生成认证字符串
百度已经为我们提供了认证字符串生成的demo、并不需要我们手动去写。认证字符串生成
php的demo:

<?php
/*
* Copyright (c) 2014 Baidu.com, Inc. All Rights Reserved
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* Http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/

namespace BaiduBce\Auth;

class SignOption
{
    const EXPIRATION_IN_SECONDS = 'expirationInSeconds';

    const HEADERS_TO_SIGN = 'headersToSign';

    const TIMESTAMP = 'timestamp';

    const DEFAULT_EXPIRATION_IN_SECONDS = 1800;

    const MIN_EXPIRATION_IN_SECONDS = 300;

    const MAX_EXPIRATION_IN_SECONDS = 129600;
}

class HttpUtil
{
    // 根据RFC 3986,除了:
    //   1.大小写英文字符
    //   2.阿拉伯数字
    //   3.点'.'、波浪线'~'、减号'-'以及下划线'_'
    // 以外都要编码
    public static $PERCENT_ENCODED_STRINGS;

    //填充编码数组
    public static function __init()
    {
        HttpUtil::$PERCENT_ENCODED_STRINGS = array();
        for ($i = 0; $i < 256; ++$i) {
            HttpUtil::$PERCENT_ENCODED_STRINGS[$i] = sprintf("%%%02X", $i);
        }

        //a-z不编码
        foreach (range('a', 'z') as $ch) {
            HttpUtil::$PERCENT_ENCODED_STRINGS[ord($ch)] = $ch;
        }

        //A-Z不编码
        foreach (range('A', 'Z') as $ch) {
            HttpUtil::$PERCENT_ENCODED_STRINGS[ord($ch)] = $ch;
        }

        //0-9不编码
        foreach (range('0', '9') as $ch) {
            HttpUtil::$PERCENT_ENCODED_STRINGS[ord($ch)] = $ch;
        }

        //以下4个字符不编码
        HttpUtil::$PERCENT_ENCODED_STRINGS[ord('-')] = '-';
        HttpUtil::$PERCENT_ENCODED_STRINGS[ord('.')] = '.';
        HttpUtil::$PERCENT_ENCODED_STRINGS[ord('_')] = '_';
        HttpUtil::$PERCENT_ENCODED_STRINGS[ord('~')] = '~';
    }

    //在uri编码中不能对'/'编码
    public static function urlEncodeExceptSlash($path)
    {
        return str_replace("%2F", "/", HttpUtil::urlEncode($path));
    }

    //使用编码数组编码
    public static function urlEncode($value)
    {
        $result = '';
        for ($i = 0; $i < strlen($value); ++$i) {
            $result .= HttpUtil::$PERCENT_ENCODED_STRINGS[ord($value[$i])];
        }
        return $result;
    }

    //生成标准化QueryString
    public static function getCanonicalQueryString(array $parameters)
    {
        //没有参数,直接返回空串
        if (count($parameters) == 0) {
            return '';
        }

        $parameterStrings = array();
        foreach ($parameters as $k => $v) {
            //跳过Authorization字段
            if (strcasecmp('Authorization', $k) == 0) {
                continue;
            }
            if (!isset($k)) {
                throw new \InvalidArgumentException(
                    "parameter key should not be null"
                );
            }
            if (isset($v)) {
                //对于有值的,编码后放在=号两边
                $parameterStrings[] = HttpUtil::urlEncode($k)
                    . '=' . HttpUtil::urlEncode((string) $v);
            } else {
                //对于没有值的,只将key编码后放在=号的左边,右边留空
                $parameterStrings[] = HttpUtil::urlEncode($k) . '=';
            }
        }
        //按照字典序排序
        sort($parameterStrings);

        //使用'&'符号连接它们
        return implode('&', $parameterStrings);
    }

    //生成标准化uri
    public static function getCanonicalURIPath($path)
    {
        //空路径设置为'/'
        if (empty($path)) {
            return '/';
        } else {
            //所有的uri必须以'/'开头
            if ($path[0] == '/') {
                return HttpUtil::urlEncodeExceptSlash($path);
            } else {
                return '/' . HttpUtil::urlEncodeExceptSlash($path);
            }
        }
    }

    //生成标准化http请求头串
    public static function getCanonicalHeaders($headers)
    {
        //如果没有headers,则返回空串
        if (count($headers) == 0) {
            return '';
        }

        $headerStrings = array();
        foreach ($headers as $k => $v) {
            //跳过key为null的
            if ($k === null) {
                continue;
            }
            //如果value为null,则赋值为空串
            if ($v === null) {
                $v = '';
            }
            //trim后再encode,之后使用':'号连接起来
            $headerStrings[] = HttpUtil::urlEncode(strtolower(trim($k))) . ':' . HttpUtil::urlEncode(trim($v));
        }
        //字典序排序
        sort($headerStrings);

        //用'\n'把它们连接起来
        return implode("\n", $headerStrings);
    }
}
HttpUtil::__init();

class SampleSigner
{

    const BCE_AUTH_VERSION = "bce-auth-v1";
    const BCE_PREFIX = 'x-bce-';

    //不指定headersToSign情况下,默认签名http头,包括:
    //    1.host
    //    2.content-length
    //    3.content-type
    //    4.content-md5
    public static $defaultHeadersToSign;

    public static function  __init()
    {
        SampleSigner::$defaultHeadersToSign = array(
            "host",
            "content-length",
            "content-type",
            "content-md5",
        );
    }

    //签名函数
    public function sign(
        array $credentials,
        $httpMethod,
        $path,
        $headers,
        $params,
        $options = array()
    ) {
        //设定签名有效时间
        if (!isset($options[SignOption::EXPIRATION_IN_SECONDS])) {
            //默认值1800秒
            $expirationInSeconds = SignOption::DEFAULT_EXPIRATION_IN_SECONDS;
        } else {
            $expirationInSeconds = $options[SignOption::EXPIRATION_IN_SECONDS];
        }

        //解析ak sk
        $accessKeyId = $credentials['ak'];
        $secretAccessKey = $credentials['sk'];

        //设定时间戳,注意:如果自行指定时间戳需要为UTC时间
        if (!isset($options[SignOption::TIMESTAMP])) {
            //默认值当前时间
            $timestamp = new \DateTime();
        } else {
            $timestamp = $options[SignOption::TIMESTAMP];
        }
        $timestamp->setTimezone(new \DateTimeZone("UTC"));

        //生成authString
        $authString = SampleSigner::BCE_AUTH_VERSION . '/' . $accessKeyId . '/'
            . $timestamp->format("Y-m-d\TH:i:s\Z") . '/' . $expirationInSeconds;

        //使用sk和authString生成signKey
        $signingKey = hash_hmac('sha256', $authString, $secretAccessKey);

        //生成标准化URI
        $canonicalURI = HttpUtil::getCanonicalURIPath($path);

        //生成标准化QueryString
        $canonicalQueryString = HttpUtil::getCanonicalQueryString($params);

        //填充headersToSign,也就是指明哪些header参与签名
        $headersToSign = null;
        if (isset($options[SignOption::HEADERS_TO_SIGN])) {
            $headersToSign = $options[SignOption::HEADERS_TO_SIGN];
        }

        //生成标准化header
        $canonicalHeader = HttpUtil::getCanonicalHeaders(
            SampleSigner::getHeadersToSign($headers, $headersToSign)
        );

        //整理headersToSign,以';'号连接
        $signedHeaders = '';
        if ($headersToSign !== null) {
            $signedHeaders = strtolower(
                trim(implode(";", array_keys($headersToSign)))
            );
        }

        //组成标准请求串
        $canonicalRequest = "$httpMethod\n$canonicalURI\n"
            . "$canonicalQueryString\n$canonicalHeader";

        //使用signKey和标准请求串完成签名
        $signature = hash_hmac('sha256', $canonicalRequest, $signingKey);

        //组成最终签名串
        $authorizationHeader = "$authString/$signedHeaders/$signature";

        return $authorizationHeader;
    }

    //根据headsToSign过滤应该参与签名的header
    public static function getHeadersToSign($headers, $headersToSign)
    {
        //value被trim后为空串的header不参与签名
        $filter_empty = function($v) {
            return trim((string) $v) !== '';
        };
        $headers = array_filter($headers, $filter_empty);

        //处理headers的key:去掉前后的空白并转化成小写
        $trim_and_lower = function($str){
            return strtolower(trim($str));
        };
        $temp = array();
        $process_keys = function($k, $v) use(&$temp, $trim_and_lower) {
            $temp[$trim_and_lower($k)] = $v;
        };
        array_map($process_keys, array_keys($headers), $headers);
        $headers = $temp;

        //取出headers的key以备用
        $header_keys = array_keys($headers);

        $filtered_keys = null;
        if ($headersToSign !== null) {
            //如果有headersToSign,则根据headersToSign过滤

            //预处理headersToSign:去掉前后的空白并转化成小写
            $headersToSign = array_map($trim_and_lower, $headersToSign);

            //只选取在headersToSign里面的header
            $filtered_keys = array_intersect_key($header_keys, $headersToSign);

        } else {
            //如果没有headersToSign,则根据默认规则来选取headers
            $filter_by_default = function($k) {
                return SampleSigner::isDefaultHeaderToSign($k);
            };
            $filtered_keys = array_filter($header_keys, $filter_by_default);
        }

        //返回需要参与签名的header
        return array_intersect_key($headers, array_flip($filtered_keys));
    }

    //检查header是不是默认参加签名的:
    //1.是host、content-type、content-md5、content-length之一
    //2.以x-bce开头
    public static function isDefaultHeaderToSign($header)
    {
        $header = strtolower(trim($header));
        if (in_array($header, SampleSigner::$defaultHeadersToSign)) {
            return true;
        }
        return substr_compare($header, SampleSigner::BCE_PREFIX, 0, strlen(SampleSigner::BCE_PREFIX)) == 0;
    }
}
SampleSigner::__init();

//签名示范代码
$signer = new SampleSigner();
$credentials = array("ak" => "0b0f67dfb88244b289b72b142befad0c","sk" => "bad522c2126a4618a8125f4b6cf6356f");
$httpMethod = "PUT";
$path = "/v1/test/myfolder/readme.txt";
$headers = array("Host" => "bj.bcebos.com",
                "Content-Length" => 8,
                "Content-MD5" => "NFzcPqhviddjRNnSOGo4rw==",
                "Content-Type" => "text/plain",
                "x-bce-date" => "2015-04-27T08:23:49Z");
$params = array("partNumber" => 9, "uploadId" => "a44cc9bab11cbd156984767aad637851");
date_default_timezone_set("PRC");
$timestamp = new \DateTime();
$timestamp->setTimestamp(1430123029);
$options = array(SignOption::TIMESTAMP => $timestamp);
$ret = $signer->sign($credentials, $httpMethod, $path, $headers, $params, $options);
print $ret;

所以我们只需要将demo中默认的BOS接口改成OCR接口即可

$signer = new SampleSigner();
$credentials = array("ak" => "你的ak","sk" => "你的sk");
$httpMethod = "POST";
$host= "word.bj.baidubce.com";
$path = "/api/v1/ocr/general";
$url="http://".$host.$path;
date_default_timezone_set('UTC');
$bceDate = date("Y-m-d") . "T" . date("H:i:s") . "Z";
$headers = array(
    "host" => $host,
    "x-bce-date" => $bceDate);
$params = array();
date_default_timezone_set("PRC");
$timestamp = new \DateTime();
$timestamp->setTimestamp(time());
$options = array(SignOption::TIMESTAMP => $timestamp);//这里还可以指定header中哪些字段需要自定义签名,如SignOption::HEADERS_TO_SIGN=>array("字段1"=>"值","字段2"=>"值")
$ret = $signer->sign($credentials, $httpMethod, $path, $headers, $params, $options);
//获取到加密字符串后准备请求接口
$tempfile="test.jpg";
$handle = fopen($tempfile,'rb');
$file_content = fread($handle,filesize($tempfile));
fclose($handle);
$encoded = base64_encode($file_content);
$data="image=".urlencode($encoded);
$head = array(
    "host:{$host}",
    "x-bce-date:{$bceDate}",
    "Authorization:{$ret}",
    "content-type: application/x-www-form-urlencoded"
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $head);
curl_setopt($ch, CURLOPT_POSTFIELDS,$data);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'POST' );
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_0);
$output = curl_exec($ch);
curl_close($ch);
print_r($output);

但是官方的demo有个小坑,导致访问OCR接口的时候一直提示

BecResponseException{httpStatus='401', requestId='null', code='AuthError', message='Bad signature or AK string and SK string do not match.'}

解决方法如下:
将官方demo代码做如下修改

        //填充headersToSign,也就是指明哪些header参与签名
        $headersToSign = null;
        if (isset($options[SignOption::HEADERS_TO_SIGN])) {
            $headersToSign = $options[SignOption::HEADERS_TO_SIGN];
        }
        $headersToSign=SampleSigner::getHeadersToSign($headers, $headersToSign);
        //生成标准化header
        $canonicalHeader = HttpUtil::getCanonicalHeaders($headersToSign);

具体原因如下
百度的文档中提到

CanonicalHeaders:对HTTP请求中的Header部分进行选择性编码的结果。

您可以自行决定哪些Header 需要编码。百度云API的唯一要求是Host域必须被编码。大多数情况下,我们推荐您对以下Header进行编码:
    Host
    Content-Length
    Content-Type
    Content-MD5
    所有以 x-bce- 开头的Header

如果这些Header没有全部出现在您的HTTP请求里面,那么没有出现的部分无需进行编码。

如果您按照我们的推荐范围进行编码,那么认证字符串中的 {signedHeaders} 可以直接留空,无需填写。

您也可以自行选择自己想要编码的Header。如果您选择了不在推荐范围内的Header进行编码,或者您的HTTP请求包含了推荐范围内的Header但是您选择不对它进行编码,那么您必须在认证字符串中填写 {signedHeaders} 。填写方法为,把所有在这一阶段进行了编码的Header名字转换成全小写之后按照字典序排列,然后用分号(;)连接。

其中header中Host、Content-Length、COntent-Type、Content-MD5、x-bce-开头的这几个字段推荐你其进行签名编码。如果都签名了,那么signedHeaders的值可以为空。百度提供的demo中会检测你传进来的header数组、如果包含上面几个字段会自动帮你签名。

实际上signedHeaders为空的话会导致提示签名字符串错误。

  • 本文地址:https://www.blear.cn/article/baidu-ocr-demo

    转载时请以链接形式注明出处

    评论